Closed avacaondata closed 3 years ago
Hi there. It's just a logging problem in the reporting of the total batch size. If we do the math, from your 5835032 samples, we get 91,172 batches per device, 11,396 batches total (divided by the number of cores) and 1,424 optimization steps (divided by the accumulation steps), which, multiplied by the 3 epochs, gives us the 4,272 steps you see.
So the number of cores is indeed taken into account.
Ahh, I see, my bad, I didn't calculate the number of steps correctly then (what a Data Scientist :P) Thank You very much @sgugger
Environment info
transformers
version: 4.2.2Who can help
@patrickvonplaten, @LysandreJik @sgugger
Information
Model I am using (Bert, XLNet ...): ALBERT base
The problem arises when using:
The problem occurs when I try to train with run_mlm_wwm.py through xla_spawn.py. I've checked that when xla_spawn calls run_mlm_ww.py, xm.xrt_world_size() is 8, which should be. However, when the Trainer starts to train, its batch size is only 64, but should be 64 num_cores = 512. I've printed out the parameters sent by xla_spawn and those received by run_mlm_wwm.py, and they coincide, thus I don't understand why in line 690 of trainer: ```{python}total_train_batch_size = self.args.train_batch_size xm.xrt_world_size()``` the total_train_batch_size is not converted to 512...
This is the full call:
The model starts to train, but it doesn't take into account that it has 8 tpu cores:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
Expected behavior
It's expected that xla_spawn.py runs the python file passed to it in a multiprocessing fashion, distributing the batches and model over the TPU cores; however, at some point the xrt_world_size() turns to 1 and it doesn't see all the devices available anymore, but only one.