Possible bug in `train_batch_size`

EmilyAlsentzer commented 3 years ago

Environment info

transformers version: 4.1.1
Platform: Linux-4.4.0-62-generic-x86_64-with-glibc2.10
Python version: 3.8.5
PyTorch version (GPU?): 1.7.1 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help

Trainer: @sgugger

Information

Model I am using (Bert, XLNet ...):

BERT

The problem arises when using:

[X ] my own modified scripts: (give details below)

I'm running a model on a toy dataset with only 2 examples and a batch size of 2. In trainer, num_examples is 2, but total_train_batch_size is 12 even though I do not have the model_parallel flag set to True (Note I do have 6 GPUs available on the machine). This doesn't seem to impact my code because train_dataset_is_sized=True, but it seems strange.

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[ X] my own task or dataset: (give details below)

toy classification jsonl dataset with 2 examples

To reproduce

I think that this line has an unnecessary not. Should this be if self.model_parallel instead of if not self.model_parallel? Thanks!

patrickvonplaten commented 3 years ago

Think @sgugger can best answer here when he's back from holiday :-)

sgugger commented 3 years ago

You misunderstand the flag model_parallel, it's not there to enable the use of several GPUs as this is done automatically by the Trainer (you have to set CUDA_VISIBLE_DEVICES to just one GPU if you don't want the Trainer to use them all). That flag is there to split the model layers on the various GPUs available (only available for a few models).

EmilyAlsentzer commented 3 years ago

Got it, I didn't realize that the Trainer automatically uses multiple GPUs if visible. Thanks!

huggingface / transformers