Closed eobrien2002 closed 6 months ago
RE: This is the error I get when I unwrap the model then try to prepare it again with a new dataset:
RuntimeError: value cannot be converted to type int without overflow return dist._verify_params_across_processes(process_group, tensors, logger)
Any help would be greatly appreciated
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Info
Hi!
I am getting an error (seen below) when passing my model into accelerate.prepare() after I have already trained the model on one participant's worth of data. I've attached the start of the code where the error occurs.
The train() method (as seen in the main() below) returns the model when it's done with the current participant data. I am using a multinode multi GPU on SLURM.
I tried a different approach where I only passed the model into accelerate.prepare() once it was loaded in; however, I got a NCCL timeout error when starting the next participant.
I'm open to suggestions to reconfigure my training approach (note, each participant is like 4GB of data). Essentially, my objective is to train my model on a dataset for one participant at a time and need to continue distributing the data across the GPUs while keeping the same model trained throughout.
Reproduction
Expected behavior
I expected the acclerate.prepare() to take the model that is currently being trained and then prepare it again with the new data. Potentially the issue is that it is already wrapped in the accelerator but my model is a pretty complex with a unet, vae, encoder and a few different layers built into it that got an error that the Accelerator could not find the transformer to wrap when I tried to do FSDP