Closed ojh31 closed 1 month ago
I can confirm the same error after upgrading to accelerate 0.33.0 and transformers 4.44.0
👋 Hi @ojh31, thank you for opening this issue!
I believe this issue is the same as in #32885. I'd like the fix to be slightly different from the one you proposed, mostly due to an ongoing refactor on our end. Have a look at my comment here
(Redirecting to the other thread to avoid multiple parallel discussions; I know this issue is older, I take issues in a LIFO queue 🤗 )
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.42.4Who can help?
@gante @SunMarc @ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Run accelerate launch --config_file=accelerate_config.yaml foo.py
foo.py:
accelerate_config.yaml:
Expected behavior
Should generate text output, but instead throws error
Hypothesis: In transformers/generation/utils.py::GenerationMixin_sample(), during the
while self._has_unfinished_sequences()
loop, we continue ifsynced_gpus and this_peer_finished
. This results in not skipping the concatenation ofnext_tokens
toinput_ids
. Whereas, we keep updating thepast_key_value
cache in transformers/models/llama/modeling_llama.py::LlamaSdpaAttention.forward(). Therefore, when one process finishes generation before the other, the finished process continues to expand the key-value cache but stops expanding the input tensors, leading to a shape mismatch. Maybe a simple fix would be to forcibly setpast_key_value
to None oncethis_peer_finished
is set to True?