Closed vgel closed 4 days ago
Thanks for the detailed report @vgel ! This is indeed a bug. I forgot that calling view modifies the tensor inplace. Would you like to open a PR to fix this ? As you tested, you just need to reshape the tensor to its original shape just after quantize_fp8_per_row
ops.
Thanks for the detailed report @vgel ! This is indeed a bug. I forgot that calling view modifies the tensor inplace. Would you like to open a PR to fix this ? As you tested, you just need to reshape the tensor to its original shape just after
quantize_fp8_per_row
ops.
Sure, just opened a PR!
System Info
transformers 4.44.0 torch 2.4.0+cu121 fbgemm_gpu 0.8.0+cu121
Who can help?
@ArthurZucker (also maybe @SunMarc ?)
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
After some digging in pdb, I tracked it down to the quantized MLPs:
I was able to patch it with this monkeypatch:
...which made
model.generate
work as expected.Expected behavior
The quantized MLP layers should not squish batch size and sequence length together. I suspect these lines are at fault, but I'm not sure:
https://github.com/huggingface/transformers/blob/52cb4034ada381fe1ffe8d428a1076e5411a8026/src/transformers/integrations/fbgemm_fp8.py#L50-L52