Open meliksahturker opened 2 months ago
Thanks for opening a PR as well, will have a look !
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
This issue is not stale and the related PR still awaits merging.
Yep, waiting for a test to be added! 🤗
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.44.2Who can help?
@RhuiDih @ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The PR 31629 allowed packing with no cross-contamination and without requiring to deal with attention masks for flash-attention-2. However, prepare_fa2_from_position_ids function produces an error when training with a batch_size greater than 1.
Below is an end-to-end example to reproduce the error:
The error:
When batch_size is set to 1, the training takes place without an error. I conducted tests on 8xH100 and 1xA100-40GB, trying different training strategies, e.g., "ddp", "deepspeed_stage_2" and ended up with the same error.
Expected behavior
The training should be possible without an error for different batch_size values.