Out of Memory, Even with Batch Size 1 and 11 GB GPU

b7leung commented 3 years ago

I'm trying to train an inverse paraphraser on my own custom dataset (I already followed these data preprocessing steps). My command is below; distributed training has been turned off. Even with a batch size of only 1, I still run out of memory on a GTX 1080TI (~11 GB). Is this expected, and are 2+ gpus are simply required? Or did I get something wrong? Is there anything else I can do make training work on 1 GPU?

python $BASE_DIR/run_lm_finetuning.py \
    --output_dir=$BASE_DIR/saved_models/298954459172700181_muffins \
    --model_type=gpt2 \
    --model_name_or_path=gpt2-large \
    --data_dir=$DATA_DIR \
    --do_train \
    --save_steps 500 \
    --logging_steps 20 \
    --save_total_limit -1 \
    --evaluate_during_training \
    --num_train_epochs 3 \
    --gradient_accumulation_steps 1 \
    --per_gpu_train_batch_size 1 \
    --job_id 298954459172700181_muffins \
    --learning_rate 5e-5 \
    --prefix_input_type paraphrase_250 \
    --global_dense_feature_list none \
    --specific_style_train -1 \
    --optimizer adam

timjones1 commented 3 years ago

Hi, it is possible to add some parameters to use fp16 instead of fp32 which saved me enough memory to train the inverse paraphraser model on a 16GB P100 on Colab Pro. Try adding --fp16 and --fp16_opt_level "O3" to the above.

You will need to install Apex Amp, which I found was best retrieved using !git clone https://github.com/NVIDIA/apex , check the readme and docs at : https://nvidia.github.io/apex/amp.html, everything is implemented already in Kalpesh's code.

Good luck, it would be cool to compare notes as I am also currently training my inverse paraphraser models.

JonOnEarth commented 3 years ago

When I run the paraphrase_many.py get the cuda out of memory error, I am not sure what parameter should to adjust?

timjones1 commented 3 years ago

@JonOnEarth did you try reducing the batch size using the --batch_size parameter? I haven't tried this but seems like a good starting point as the default is 64.

martiansideofthemoon commented 3 years ago

If batch size 1 doesn't fit, you should try a smaller model like gpt2-medium (it's not too much worse). Gradient checkpointing is also an option, but will need more work. We trained all our models on a 24 GB GPU

martiansideofthemoon / style-transfer-paraphrase

Out of Memory, Even with Batch Size 1 and 11 GB GPU #14