kimiyoung / transformer-xl

Apache License 2.0
3.6k stars 762 forks source link

Cuda out of memory #63

Open agemagician opened 5 years ago

agemagician commented 5 years ago

I have 6 titan GPUs machine with 12 GB memory, I changed the code to add my own dataset. However, I always get cuda out of memory:

Run training...
Experiment dir : /home/agemagician/Downloads/transformer-xl/pytorch/models/uniref50/base_v1-uniref50/20190419-023635
Loading cached dataset...
Traceback (most recent call last):
  File "train.py", line 190, in <module>
    device=device, ext_len=args.ext_len)
  File "/home/agemagician/Downloads/transformer-xl/pytorch/data_utils.py", line 239, in get_iterator
    data_iter = LMOrderedIterator(self.train, *args, **kwargs)
  File "/home/agemagician/Downloads/transformer-xl/pytorch/data_utils.py", line 29, in __init__
    self.data = data.view(bsz, -1).t().contiguous().to(device)
RuntimeError: CUDA out of memory. Tried to allocate 40.00 GiB (GPU 0; 11.75 GiB total capacity; 0 bytes already allocated; 11.08 GiB free; 0 bytes cached)

It doesn't matter whatever, I reduced the model size or the target length, or even add batch chunk. Here is my bash file:

#!/bin/bash

if [[ $1 == 'train' ]]; then
    echo 'Run training...'
    python train.py \
        --cuda \
        --data /media/agemagician/Disk2/projects/protin/dataset/uniref50_transformer_xl \
        --dataset uniref50 \
        --n_layer 12 \
        --d_model 512 \
        --n_head 8 \
        --d_head 64 \
        --d_inner 2048 \
        --dropout 0.1 \
        --dropatt 0.0 \
        --optim adam \
        --lr 0.00025 \
        --warmup_step 10000 \
        --max_step 400000 \
        --tgt_len 200 \
        --mem_len 200 \
        --eval_tgt_len 128 \
        --batch_size 24 \
        --multi_gpu \
        --varlen \
        --gpu0_bsz 4 \
        --fp16 \
        --dynamic-loss-scale \
        --batch_chun 4 \
        ${@:2}
elif [[ $1 == 'eval' ]]; then
    echo 'Run evaluation...'
    python eval.py \
        --cuda \
        --data /media/agemagician/Disk2/projects/protin/dataset/uniref50_transformer_xl \
        --dataset uniref50 \
        --tgt_len 80 \
        --mem_len 4096 \
        --clamp_len 820 \
        --same_length \
        --split test \
        ${@:2}
else
    echo 'unknown argment 1'
fi

It seems the script wants to load the whole data file into the GPU memory at once.

agemagician commented 5 years ago

I solved the problem by changing line number 29 in data_utils.py:

# Evenly divide the data across the bsz batches.
#self.data = data.view(bsz, -1).t().contiguous().to(device)
self.data = data.view(bsz, -1).t().contiguous().to('cpu')

Apparently, the train.py send cuda as a device and that was the issue.