mps: weird results given by Transformer CausalLM

Willian-Zhang commented 2 years ago

🐛 Describe the bug

%env PYTORCH_ENABLE_MPS_FALLBACK=1
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
assert torch.backends.mps.is_built()
assert torch.backends.mps.is_available()

pretrain_model = "bigscience/bloom-560m"
device= "mps"

tokenizer = AutoTokenizer.from_pretrained(pretrain_model)
model = AutoModelForCausalLM.from_pretrained(pretrain_model).to(device)
generator = transformers.pipeline(task="text-generation", model=model, tokenizer=tokenizer, device=torch.device(device))

r = generator("This is a conversation between A and B. \nA: Your should say something meaningful.\nB:", max_length=50, use_cache=True)
print(r[0]['generated_text'])

This should give something like:

using device= cpu" actually results in this:

This is a conversation between A and B. 
A: Your should say something meaningful.
B: I don't know what you mean. But I think you should say something meaningful.

However it actually gives:

This is a conversation between A and B. 
A: Your should say something meaningful.
B: is is is isThisThisThisThisThisThisThisThisThisThisThisThisThisThisThisThisThisThisThisThisThisThisThisThisThisThisThisThisThis

Same behavior not only applies to bigscience/bloom-560m, all CausalLM seems to results in similar behavior.

Just FYI

To get rid of warning and rule out weird behavior from MPS fallback, I added some code to Transformer source site-packages/transformers/models/bloom/modeling_bloom.py This does not effect the reported bug behavior.

if attention_mask.shape[0] == 1 and not torch.any(attention_mask -1):
  position_ids = torch.arange(attention_mask.shape[-1], dtype=torch.long, 
                              device=attention_mask.device).expand(attention_mask.shape)
else:
  print(attention_mask)
  position_ids = attention_mask.cumsum(-1)

Versions

torch                   1.13.0.dev20220827
transformers            4.21.2

cc @ezyang @gchanan @zou3519 @kulinseth @albanD @malfet @DenisVieriu97 @razarmehr @abhudev

aljungberg commented 2 years ago

I had the same issue running neox on the M1.

https://github.com/zphang/minimal-gpt-neox-20b/issues/5

With mps, I got "...developed by EleutherAI. in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in" and with cpu on the same machine, I got "... developed by EleutherAI. It is a state-of-the-art language model...".

That implementation doesn't use transformers at all, it's just plain pytorch. I tested on 1.13.0.dev20220803.

undefdev commented 1 year ago

I have the same issue with galactica-6.7b used with huggingface's transformers. This is a minimal example to reproduce:

from transformers import AutoTokenizer, OPTForCausalLM

tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-6.7b")

model = OPTForCausalLM.from_pretrained("facebook/galactica-6.7b").to("mps")

input_text = "The Transformer architecture [START_REF]"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("mps")

outputs = model.generate(input_ids, max_new_tokens = 20)
print(tokenizer.decode(outputs[0]))

output with mps:

The Transformer architecture [START_REF]      results results results results results results results results results results results results results results results

________________________________________________________
Executed in   40.03 secs    fish           external
   usr time   25.76 secs    0.17 millis   25.76 secs
   sys time   24.76 secs    2.54 millis   24.76 secs

output when removing to("mps") (running in cpu mode):

The Transformer architecture [START_REF] Attention is All you Need, Vaswani[END_REF] is a sequence-to-sequence model that uses self

________________________________________________________
Executed in   56.61 secs    fish           external
   usr time   42.90 secs    0.17 millis   42.90 secs
   sys time   27.70 secs    2.38 millis   27.70 secs

torch version: 1.14.0.dev20221117

I'm using a Apple M1 Max Macbook with MacOS 13.0

marcj commented 1 year ago

Setting use_cache to False fixed it for me. ,e.g.

outputs = model.generate(input_ids=input_ids, do_sample=False, use_cache=False, max_new_tokens=max_length)

SvenStahlmann commented 1 year ago

What can be the cause of this? I have the same problem in a different pacakage but my model does not have a genrate function with use_cache

kulinseth commented 1 year ago

What can be the cause of this? I have the same problem in a different pacakage but my model does not have a genrate function with use_cache

Is this still happening with latest nightly ?

SvenStahlmann commented 1 year ago

Hey, yes i just tested it. It is still happening wit pytorch-2.1.0.dev202 (installed today). The quality of the output is way worse when unsing "mps" compares to "cpu" on mac

kulinseth commented 1 year ago

Thanks @SvenStahlmann and @Willian-Zhang, we will investigate the issue.

DenisVieriu97 commented 1 year ago

@Willian-Zhang thanks for filling this issue. Could you please try latest nightly? This should be fixed there: pip3 install --pre --force-reinstall torch --index-url https://download.pytorch.org/whl/nightly/cpu

This is a conversation between A and B.
A: Your should say something meaningful.
B: I don't know what you mean. But I think you should say something meaningful.

Willian-Zhang commented 1 year ago

I can confirm the problem is gone with torch-2.1.0.dev20230804 on macOS 13.5 (22G74).

@Willian-Zhang thanks for filling this issue. Could you please try latest nightly? This should be fixed there: pip3 install --pre --force-reinstall torch --index-url https://download.pytorch.org/whl/nightly/cpu
This is a conversation between A and B.
A: Your should say something meaningful.
B: I don't know what you mean. But I think you should say something meaningful.

pytorch / pytorch