Inconsistent behavior on CPU vs. GPU

mar-muel commented 3 years ago

Environment info

transformers version: 4.5.1
Platform: Linux-4.19.0-16-cloud-amd64-x86_64-with-debian-10.9
Python version: 3.7.10
PyTorch version (GPU?): 1.8.1+cu111 (True)
Tensorflow version (GPU?): 2.5.0 (False)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

Information

Model I am using (Bert, XLNet ...): AutoModel

To reproduce

Steps to reproduce the behavior:

Hi all - I've been struggling with inconsistent behavior on CPU vs. GPU.

When running on CPU the following code works as expected:

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

def predict(model, tokenizer, test_str, device):
    input_ids = tokenizer(test_str, return_tensors='pt', padding=True).to(device)
    model.to(device)
    model.eval()
    with torch.no_grad():
        pred = model(**input_ids)
    logits = pred.logits.cpu()
    return logits

device = 'cpu'
model_dir = 'test_dir'
model_type = 'roberta-base'
test_str = [
    'Hello! I am a test string!',
    ]

model = AutoModelForSequenceClassification.from_pretrained(model_type, num_labels=1)
tokenizer = AutoTokenizer.from_pretrained(model_type)

# save model
model.save_pretrained(model_dir)

pred1 = predict(model, tokenizer, test_str, device)
print(pred1)

model = AutoModelForSequenceClassification.from_pretrained(model_dir)
pred2 = predict(model, tokenizer, test_str, device)
print(pred2)

Output:

# Obviously output is random, however is identical
tensor([[-0.0238]])
tensor([[-0.0238]])

But when I change the to cuda by changing the device

device = 'cuda'

I get a significantly different output:

tensor([[-0.3194]])
tensor([[-0.3414]])

Weirdly the above doesn't happen if I increase the length of my test string:

test_str = [
    'Hello! I am a test string! Hello! I am a test string! Hello! I am a test string! Hello! I am a test string! ',
    ]

I'm pretty sure I'm missing something obvious - any help is appreciated! 🙏

Expected behavior

I expect the output of the loaded model to be identical not only on CPU but also on GPU.

LysandreJik commented 3 years ago

Hello! This is weird, you indeed get a significantly different output. Running your exact code sample above, only changing the device to cuda yields the same results for me:

tensor([[0.0769]])
tensor([[0.0769]])

Tried it a few times, and I always get the same results - I've added an additional statement to ensure we get the exact same output:

print(torch.allclose(pred1, pred2))

And we do!

I feel this may be a setup issue - would you mind opening trying it on Colab and sharing it if you get the same results so that I can investigate?

mar-muel commented 3 years ago

Thanks a lot @LysandreJik - Yes, indeed there's no issues on Colab.

I turns out the problem only occurs with PyTorch versions

# pip freeze | grep torch
torch==1.8.1+cu111
torchaudio==0.8.1
torchvision==0.9.1+cu111

But using torch==1.8.1 works fine.

This is the output of my nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   49C    P0    70W / 149W |      0MiB / 11441MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I created my environment like this:

conda create -n ml python==3.8
conda activate ml
pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
pip install transformers

Would you mind checking whether you can reproduce with the above?

I'd really like to understand what's going on here 😅

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers