tech-srl / code2seq

Code for the model presented in the paper: "code2seq: Generating Sequences from Structured Representations of Code"
http://code2seq.org
MIT License
555 stars 164 forks source link

I got Out of Memory Error during Training #108

Closed Avv22 closed 2 years ago

Avv22 commented 2 years ago

Hello,

We run the 2.1 tensorflow implementation on our machine that has 16 GB of RAM and 4 GB of GPU as you specified in your documentation:

!/usr/bin/env bash

DATA_DIR=$(pwd)/data/
data_dir=$1
data_name=$(basename "${data_dir}")
data=${data_dir}/${data_name}
test=${data_dir}/${data_name}.val.c2s
run_name=$2
model_dir=$(pwd)/models/python150k-${run_name}
save=$(pwd)/model
SEED=239
DESC=default
CUDA=1

mkdir -p "${model_dir}"
set -e
CUDA_VISIBLE_DEVICES=1
python -u code2seq.py \
  --data="${data}" \
  --test="${test}" \
  --save="${save}" \
  --seed="${SEED}"

Then run ./train_python150k.sh as follows:

$ ./train_python150k.sh $DATA_DIR $DESC $CUDA $SEED

We go the following error:

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[320,26350] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Add] 0%| | 0/337723 [00:20<?, ?it/s]

Edit: I even tried smaller python dataset around 1 GB for both train and test and got same error as above. Tensor size is large. Number of trainable parameters are around 5 million.

I change config.py file to the following (divided all values by 2 for each variable), not sure if this is recommended please:

class Config:
    @staticmethod
    def get_default_config(args):
        config = Config(args)
        config.NUM_EPOCHS = 3000
        config.SAVE_EVERY_EPOCHS = 1
        config.PATIENCE = 10
        config.BATCH_SIZE = 256
        config.TEST_BATCH_SIZE = 128
        config.READER_NUM_PARALLEL_BATCHES = 1
        config.SHUFFLE_BUFFER_SIZE = 50000
        config.CSV_BUFFER_SIZE = 50 * 512 * 512  # 100 MB
        config.MAX_CONTEXTS = 200
        config.SUBTOKENS_VOCAB_MAX_SIZE = 80000
        config.TARGET_VOCAB_MAX_SIZE = 14000
        config.EMBEDDINGS_SIZE = 64
        config.RNN_SIZE = 64 * 2  # Two LSTMs to embed paths, each of size 128
        config.DECODER_SIZE = 150
        config.NUM_DECODER_LAYERS = 1
        config.MAX_PATH_LENGTH = 8 + 1
        config.MAX_NAME_PARTS = 5
        config.MAX_TARGET_PARTS = 6
        config.EMBEDDINGS_DROPOUT_KEEP_PROB = 0.75
        config.RNN_DROPOUT_KEEP_PROB = 0.5
        config.BIRNN = True
        config.RANDOM_CONTEXTS = True
        config.BEAM_WIDTH = 0
        config.USE_MOMENTUM = True
        return config

The original config.py file has:

class Config:
    @staticmethod
    def get_default_config(args):
        config = Config(args)
        config.NUM_EPOCHS = 3000
        config.SAVE_EVERY_EPOCHS = 1
        config.PATIENCE = 10
        config.BATCH_SIZE = 128
        config.READER_NUM_PARALLEL_BATCHES = 1
        config.SHUFFLE_BUFFER_SIZE = 10000
        config.CSV_BUFFER_SIZE = 100 * 1024 * 1024  # 100 MB
        config.MAX_CONTEXTS = 100
        config.SUBTOKENS_VOCAB_MAX_SIZE = 190000
        config.TARGET_VOCAB_MAX_SIZE = 27000
        config.EMBEDDINGS_SIZE = 128
        config.RNN_SIZE = 128 * 2  # Two LSTMs to embed paths, each of size 128
        config.DECODER_SIZE = 320
        config.NUM_DECODER_LAYERS = 1
        config.MAX_PATH_LENGTH = 8 + 1
        config.MAX_NAME_PARTS = 5
        config.MAX_TARGET_PARTS = 6
        config.EMBEDDINGS_DROPOUT_KEEP_PROB = 0.75
        config.RNN_DROPOUT_KEEP_PROB = 0.5
        config.BIRNN = True
        config.RANDOM_CONTEXTS = True
        config.BEAM_WIDTH = 0
        config.USE_MOMENTUM = True
        return config

I run the training script with default option thus I change default part above in config.py :

$ ./train_python150k.sh $DATA_DIR default $CUDA $SEED

Note: the model is still training, so I am not sure what would be the output. It has finished 1 Epoches so far. So it seems my issue was the buffer/shuffle size. However, do you think halving parameters would effect your model training please? If this is not recommended, could be you please give me acceptable parameters sized decreased as your original config.py file configuration give me OOM error.

chaseleif commented 1 week ago

If anyone comes across OOM errors with "newer" versions of TF, there was a memory leak that was introduced in TF shortly after 2.8.2. The tensorflow-addons module is deprecated and states their latest supported TF is 2.14. With TF 2.14, I could see the used memory continuing to grow using nvidia-smi until training would crash with OOM. Switching to TF 2.8.2 fixed this issue for me.