Open asc-raynor opened 1 week ago
Thanks all for the report and sorry for the delay, we're looking into it cc @muellerzr @SunMarc
same issue, but with DPOTrainer (probably I also have it with SFTTrainer, but haven't tested). The error only occurs for me in multi-worker/multi-gpu/multi-node training, when using FSDP with single GPU there is no error. The issue also is not present in 4.45.2. I am wondering if it is due to this change?
v4.45.2 (in mistral_modeling)
hidden_states = outputs[0]
if labels is None and not is_torchdynamo_compiling():
logger.warning_once(
"Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)"
)
# Only compute necessary logits, and do not upcast them to float if we are not computing the loss
# TODO: remove the float() operation in v4.46
logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :]).float()
in v4.46.2
hidden_states = outputs[0]
# Only compute necessary logits, and do not upcast them to float if we are not computing the loss
logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :])
my traceback looks like this
File "/tmp/ray/session_2024-11-13_12-57-50_682472_12/runtime_resources/working_dir_files/_ray_pkg_92dffa2da1edbd43/fine_tune/main.py", line 87, in train_func
trainer.train()
File "/tmp/ray/session_2024-11-13_12-57-50_682472_12/runtime_resources/pip/885b4123dae986bae1106a4662ccedcbc5ae220d/virtualenv/lib/python3.11/site-packages/transformers/trainer.py", line 2123, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-11-13_12-57-50_682472_12/runtime_resources/pip/885b4123dae986bae1106a4662ccedcbc5ae220d/virtualenv/lib/python3.11/site-packages/transformers/trainer.py", line 2534, in _inner_training_loop
self.optimizer.step()
File "/tmp/ray/session_2024-11-13_12-57-50_682472_12/runtime_resources/pip/885b4123dae986bae1106a4662ccedcbc5ae220d/virtualenv/lib/python3.11/site-packages/accelerate/optimizer.py", line 171, in step
self.optimizer.step(closure)
File "/tmp/ray/session_2024-11-13_12-57-50_682472_12/runtime_resources/pip/885b4123dae986bae1106a4662ccedcbc5ae220d/virtualenv/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 137, in wrapper
return func.__get__(opt, opt.__class__)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-11-13_12-57-50_682472_12/runtime_resources/pip/885b4123dae986bae1106a4662ccedcbc5ae220d/virtualenv/lib/python3.11/site-packages/torch/optim/optimizer.py", line 487, in wrapper
out = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-11-13_12-57-50_682472_12/runtime_resources/pip/885b4123dae986bae1106a4662ccedcbc5ae220d/virtualenv/lib/python3.11/site-packages/torch/optim/optimizer.py", line 91, in _use_grad
ret = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-11-13_12-57-50_682472_12/runtime_resources/pip/885b4123dae986bae1106a4662ccedcbc5ae220d/virtualenv/lib/python3.11/site-packages/torch/optim/adamw.py", line 220, in step
adamw(
File "/tmp/ray/session_2024-11-13_12-57-50_682472_12/runtime_resources/pip/885b4123dae986bae1106a4662ccedcbc5ae220d/virtualenv/lib/python3.11/site-packages/torch/optim/optimizer.py", line 154, in maybe_fallback
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-11-13_12-57-50_682472_12/runtime_resources/pip/885b4123dae986bae1106a4662ccedcbc5ae220d/virtualenv/lib/python3.11/site-packages/torch/optim/adamw.py", line 782, in adamw
func(
File "/tmp/ray/session_2024-11-13_12-57-50_682472_12/runtime_resources/pip/885b4123dae986bae1106a4662ccedcbc5ae220d/virtualenv/lib/python3.11/site-packages/torch/optim/adamw.py", line 375, in _single_tensor_adamw
exp_avg.lerp_(grad, 1 - beta1)
RuntimeError: expected dtype float for `end` but got dtype c10::BFloat16
The latest version of TRL (0.12.0) seems to have some issues, but version 0.11.3 works fine.
Same issue. I can't use FSDP with TRL anymore. Everything works again if I downgrade Accelerate+Transformers+TRL as if we were in September. Not related to Pytorch (I tried from 2.1 to 2.6). It might be related to something introduced in Transformers 4.46.2. To be confirmed.
Hi! Thanks for the bug report. This should be fixed via https://github.com/huggingface/transformers/pull/34645, can you install transformers via pip install git+https://github.com/huggingface/transformers
? Thanks for your patience while we figure out ripple effects from the grad accum changes 🤗
System Info
pytorch 2.2 and 2.4 are tested. transformers 4.46.2 4 * A6000 ada
Who can help?
@muellerzr
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
fsdp training code from 'https://huggingface.co/docs/peft/accelerate/fsdp' but got expected dtype float for
end
but got dtype c10::BFloat16 error. I changed dtype (float16, 32, bfloat16) but failed to run the code. What`s the problem?param:
Expected behavior
FSDP training