Accelerate failing on multi-gpu rng synchronization

LWprogramming commented 1 year ago

I can do semantic but not coarse transformer training right now. Here's what the error message looks like:

File "/path/to/trainer.py", line 999, in train_step
data_kwargs = dict(zip(self.ds_fields, next(self.dl_iter)))
File "/path/to/trainer.py", line 78, in cycle
for data in dl:
File "/path/to/venv/site-packages/accelerate/data_loader.py", line 367, in iter
synchronize_rng_states(self.rng_types, self.synchronized_generator)
File "/path/to/venv/site-packages/accelerate/utils/random.py", line 100, in synchronize_rng_states
synchronize_rng_state(RNGType(rng_type), generator=generator)
File "/path/to/venv/site-packages/accelerate/utils/random.py", line 95, in synchronize_rng_state
generator.set_state(rng_state)
RuntimeError: Invalid mt19937 state

This is in the trainer.py file. I don't think the dataloaders are constructed any differently so I'm confused if this is expected (also wasn't clear what generator means vs. rng type cuda or whatever). Do you have ideas for why this might be failing only on coarse but not semantic?

I found this issue with the same error message but it never got resolved unfortunately, and didn't find any similar issues besides that one.

lucidrains commented 1 year ago

@LWprogramming i can't tell from first glance; code looks ok from a quick scan

i may be getting back to audio stuff / TTS later this week, so can help with this issue then

are you using Encodec?

LWprogramming commented 1 year ago

Yeah, using Encodec. Do you suspect that the codec might be the issue somehow?

I also notice (after adding some more prints) that we see some weird behavior:

all the GPUs make it to on device {device}: accelerator has...
only the main GPU makes it to device {device} arrived at 2
main gpu crashes at the wait_for_everyone() shortly after point 2, so it never arrives at 3. That seems like wait_for_everyone() is either causing or maybe exposing an issue if the other GPUs are already unable to train properly.

lucidrains commented 1 year ago

i don't really know, but probably good to rule out an external library as the issue

will get back to this either end of this week or next Monday. going all out on audio again soon

LWprogramming commented 1 year ago

OK, this is pretty baffling. I tried rearranging the order in which I train semantic, coarse, and fine (starting with coarse and then semantic) and it ran fine and I was actually able to get samples! Still using my script, gotta run now but to take a look in a bit. Not sure why it reliably breaks down immediately at the start of coarse if I do semantic, coarse, then fine in that order?

lucidrains commented 1 year ago

are you training them all at once?

LWprogramming commented 1 year ago

Yeah, the setup is something like (given some configurable integer save_every):

train semantic for save_every steps, then train coarse for save_every steps, then fine. then try sampling, then do another save_every steps per trainer, and repeat. This way we can gradually see what the samples look like as the transformers gradually train

lucidrains commented 1 year ago

ohh! yeah that's the issue then

you can only train one network per training script

lucidrains commented 1 year ago

I can add some logic to prevent this issue in the future, with an informative error

LWprogramming commented 1 year ago

wait what haha

does accelerator do something weird that can only happen once per call?

(also are you defining training script as a single python script? e.g. can only prepare accelerator once per execution of the thing I call with accelerate launch)

lucidrains commented 1 year ago

you'd need the training script to be executed 3 times separately for training each network, with each script terminating before the next. Then you put all the models together

LWprogramming commented 1 year ago

Oh interesting, is this something that's built into accelerate or is it specific to your code? don't recall seeing any warnings about this in the huggingface docs or if it's your code, which part assumes that haha

lucidrains commented 1 year ago

@LWprogramming this is just how neural network training is generally done today, if you have multiple big networks to train

lucidrains commented 1 year ago

i can add the error message later today! this is a common gotcha, which i handled before over at imagen-pytorch (which is also multiple networks)

LWprogramming commented 1 year ago

Ahh ok! I'll have to rewrite some of my code haha

(for anyone looking at this in the future, I just talked to a friend of mine and they pointed out that training multiple models in parallel either requires moving parameters on and off gpu a lot more, or if they're small enough, they can still fit in memory but then the batch size is necessarily smaller. I guess I still don't know what exactly caused things to break but it doesn't matter so much now.)

Thanks so much!

lucidrains commented 1 year ago

haha yea, we are still in the mainframe days of deep learning. A century from now, maybe it won't even matter

lucidrains / audiolm-pytorch

Accelerate failing on multi-gpu rng synchronization #209