pytorch / torchtune

PyTorch native finetuning library
https://pytorch.org/torchtune/main/
BSD 3-Clause "New" or "Revised" License
4.35k stars 441 forks source link

Pretraining Cuda Out of Memory Issue #1932

Open muniefht opened 3 weeks ago

muniefht commented 3 weeks ago

I have a device containing 4 Nvidia L40 GPUs. I am trying to use the full_finetune_distributed llama3_1/8B_full recipe. My configuration for dataset in the config file is given below: dataset: component: torchtune.datasets.text_completion_dataset source: "text" column: "text" packed: false split: "train" data_files: "pretrain-data-batch1-quartered/*.txt"

M The data is all txt files. Initially I had planned to use 256M tokens to start the pretraining job but I got the Cuda Out Of Memory error. I have now reduced my files to 1/4th and still I am getting the same error for both on full_finetune_distributed as well as on lora_finetune_distributed as well. I have also reduced my batch size to 1. Still no success. I have following questions in my mind:

SalmanMohammadi commented 3 weeks ago

Hey @muniefht! Great to see you checking out torchtune.

How long are the samples in your dataset? If you haven't set a maximum sequence length in your tokenizer config, you might be filling up GPU memory with quite large sequences - particularly since the model you're using has a maximum sequence length of ~131k, which we'll use if the tokenizer doesn't have a maximum sequence length set.

How much resources will I need to pretrain using either full or lora based recepie?

For reference, you can see some of our benchmarks with the model you're using on different hardware setups here which all use a maximum sequence length of 2048. Another thing to try would be enabling sample packing through dataset.packed. If there's significant variability in the length of samples in your dataset this can boost performance a fair bit.

I'm not 100% sure on your questions about your data setup - these are some things that come immediately to mind. Let me know how you get on with these and we can dig a bit deeper if they don't help.

felipemello1 commented 3 weeks ago

@muniefht , if you are using nightlities and gradient accumulation, you will have OOM issues. This PR fixed it and will land today: https://github.com/pytorch/torchtune/pull/1917

I would also suggest you have the following: data.packed=True # improves speed greatly, and you wont have memory spikes, because the max_seq_len will be fixed. Needs to set tokenizer.max_seq_len=X compile=True # speed and memory activation_checkpointing=True # saves a lot of memory, but its slower activation_offloading=True # saves a lot of memory, but can be a bit slower. Sometimes it isnt.

If you can fit a high enough batch without using grad_accumulation, you can also set optimizer_in_bwd=True, which saves a lot of memory.

you can read more about these techniques here: https://pytorch.org/torchtune/main/tutorials/memory_optimizations.html

muniefht commented 3 weeks ago

@SalmanMohammadi My dataset is a collection of txt files. Some of them are quite long. I have computed the statistics of the files: Maximum file length: 768221 Minimum file length: 6 Average file length: 12169.60 I think the larger files are causing troubles? Maybe I need to split the content of bigger files into multiple subfiles because each file is being treated as a sample/row/sequence? and the length is causing the trouble?

muniefht commented 3 weeks ago

Update: I splitted the content of text files longer than 2048 into multiple sub files. It has started the training. No other parameters were changed. My question is. Will it impact the model performance? The data as you know is just raw unstructured txt files.

joecummings commented 3 weeks ago

This should have no effect on model training since in text completion everything is just predicting the next token!

I'd recommend looking at the parameter changes suggested by @felipemello1 b/c those will speed up training for you.

felipemello1 commented 3 weeks ago

the dataset shouldnt impact gpu memory, because we don't load the whole dataset in the gpu. We only send the batch to GPU right before the training step. So what I think is happening is that you have some sequence that is very long, and when you try to put this in GPU, you get out of memory (OOM).

If that's the case, then you don't have to manually change the dataset. What you can do is just set tokenizer.max_seq_len=2048 (or some other number).

Let me know if that makes sense. But FYI, by just using the parameters i mentioned, you can go from 70+GiB to ~20GiB, depending on the model size and batch size