Open mgerstgrasser opened 1 month ago
cc @muellerzr @SunMarc
cc @muellerzr @SunMarc
Oh, to be clear, this is fixed by #30897 The underlying problem might still be there, that saving the model "too soon" after preparing it for FSDP can cause FSPD to fail. But I don't know if that needs fixing, or if the fix is simply "don't do that".
System Info
transformers==4.41.0
Who can help?
@pacman100 @muellerzr
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
In an extraordinary case of "things that can't possibly be interacting yet somehow they are", it seems that in 4.41.0, logging to wandb breaks distributed training with FSDP!
Taking the trl SFT example as a basis:
This works for
transformers==4.40.2
, but crashes withtransformers==4.41.0
. Setting instead--report_to none
, it still works in 4.41.0The error is pretty non-descript:
Expected behavior
I'd expect this to not crash.