Training GPT2 with run_clm.py exceeds the described memory amount .

CLL112 commented 1 month ago

System Info

transformers version: 4.40.0.dev0
Platform: Linux-6.5.0-28-generic-x86_64-with-glibc2.17
Python version: 3.8.19
Huggingface_hub version: 0.22.2
Safetensors version: 0.4.2
Accelerate version: 0.29.2
Accelerate config: not found
PyTorch version (GPU?): 1.10.0+cu111 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker and @younesbelkada

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

python run_clm.py \ --model_name_or_path openai-community/gpt2 \ --dataset_name wikitext \ --dataset_config_name wikitext-2-raw-v1 \ --per_device_train_batch_size 8 \ --per_device_eval_batch_size 8 \ --do_train \ --do_eval \ --overwrite_output_dir \ --output_dir /tmp/test-clm

Expected behavior

The example in the script mentions training with a K80 GPU at a batch size of 8, noting that the K80 has 24GB of memory. However, when I use an RTX 3090 with a batch size set to 4, it consumes 20GB of memory without modifying any settings. Why is this the case?

CLL112 commented 1 month ago

When I set the batch size to 8, it results in an out-of-memory error.

ArthurZucker commented 1 month ago

Hey! Thanks for opening, won't really have time to dive, the script is suuuuuper old! Leaving it as a good second issue for the community!

swapdewalkar commented 1 month ago

@ArthurZucker @amyeroberts Can I take this up?

amyeroberts commented 1 month ago

@swapdewalkar Sure!

pranav-bot commented 1 month ago

@ArthurZucker @amyeroberts if this hasn't been solved yet can I take this up??

amyeroberts commented 1 month ago

Hi @pranav-bot, we prioritize based on PRs opened rather than claiming on issues, so you are free to tackle this if you'd like.

The reply above looks like an output from an LLM (listed, generic, non-specific solutions which don't address the problem). Please refrain from doing this within issues, as low quality replies add noise and make it harder for everyone to find the right answer.

pranav-bot commented 1 month ago

Sorry, I honestly did not mean to do that! @amyeroberts

@CLL112

So I tried out a couple of things on colab(Tesla T4 GPU), here's what I found out: K80 has 2 sets of 12GB but contain 24GB and allocate only one PCIe slot, where as RTX 3090 only has one set i.e. 24 GB so model paralleization may have an effect as nvidia-smi logs the two sets as different components and since wikitext cant' stream data on trying it gives the following error -> (ValueError: The train_dataset does not implement len, max_steps has to be specified. The number of steps needs to be known in advance for the learning rate scheduler.) it may not be a matter of memory leaks of unclosed data streamers, it might also be because of the data generators that load the dataset causing high memory usage on the one set of RTX 3090. It may also not be a case of unnecessary data copies or inefficient data loading becausr that should have affected both the cards. +-------------------------------+----------------------+----------------------+ | 1 Tesla K80 Off | 0000:04:00.0 Off | 0 | | N/A 41C P0 66W / 149W | 11022MiB / 11519MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla K80 Off | 0000:05:00.0 Off | 0 | | N/A 37C P0 67W / 149W | 130MiB / 11519MiB | 0% Default | +-------------------------------+----------------------+----------------------+ Maybe running the model on half precision on a rtx 3090 may result to less memory usage?

CLL112 commented 1 month ago

@pranav-bot Thanks, is the batch size for your experiment set to 8? It only appears to be using 11GB of memory, which is a huge difference compared to the memory usage on an RTX 3090. After I used fp16, I was able to set the batch size to 6 and it consumed 22GB of memory.

huggingface / transformers