GPT2 parallelism does not work on the Tesla K80

How to reproduce

from transformers import AutoModelForCausalLM, AutoTokenizer
from parallelformers import parallelize

model = AutoModelForCausalLM.from_pretrained("distilgpt2")
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")

parallelize(model, num_gpus=2, fp16=True, verbose='detail')

inputs = tokenizer("Parallelformers is", return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_length=15,
)

print(f"Output: {tokenizer.batch_decode(outputs)[0]}")

Problem

The system distributes the model between GPUs, but when generating the second GPU is 100% loaded and does not leave this state. Generation failed.

Environment

PyTorch version: 1.10.1+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.7.13 (default, Mar 29 2022, 02:18:16)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.15.0-187-generic-x86_64-with-debian-buster-sid
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: NVIDIA Tesla K80
GPU 1: NVIDIA Tesla K80

Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.21.6
[pip3] torch==1.10.1+cu113
[conda] numpy                     1.21.6                   pypi_0    pypi
[conda] torch                     1.10.1+cu113             pypi_0    pypi

tunib-ai / parallelformers

GPT2 parallelism does not work on the Tesla K80 #27

How to reproduce

Problem

Environment