Hello,

We run the 2.1 tensorflow implementation on our machine that has 16 GB of RAM and 4 GB of GPU as you specified in your documentation:

!/usr/bin/env bash

DATA_DIR=$(pwd)/data/
data_dir=$1
data_name=$(basename "${data_dir}")
data=${data_dir}/${data_name}
test=${data_dir}/${data_name}.val.c2s
run_name=$2
model_dir=$(pwd)/models/python150k-${run_name}
save=$(pwd)/model
SEED=239
DESC=default
CUDA=1

mkdir -p "${model_dir}"
set -e
CUDA_VISIBLE_DEVICES=1
python -u code2seq.py \
  --data="${data}" \
  --test="${test}" \
  --save="${save}" \
  --seed="${SEED}"

Then run ./train_python150k.sh as follows:

$ ./train_python150k.sh $DATA_DIR $DESC $CUDA $SEED

We go the following error:

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[320,26350] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Add] 0%| | 0/337723 [00:20<?, ?it/s]

Edit: I even tried smaller python dataset around 1 GB for both train and test and got same error as above. Tensor size is large. Number of trainable parameters are around 5 million.

I change config.py file to the following (divided all values by 2 for each variable), not sure if this is recommended please:

class Config:
    @staticmethod
    def get_default_config(args):
        config = Config(args)
        config.NUM_EPOCHS = 3000
        config.SAVE_EVERY_EPOCHS = 1
        config.PATIENCE = 10
        config.BATCH_SIZE = 256
        config.TEST_BATCH_SIZE = 128
        config.READER_NUM_PARALLEL_BATCHES = 1
        config.SHUFFLE_BUFFER_SIZE = 50000
        config.CSV_BUFFER_SIZE = 50 * 512 * 512  # 100 MB
        config.MAX_CONTEXTS = 200
        config.SUBTOKENS_VOCAB_MAX_SIZE = 80000
        config.TARGET_VOCAB_MAX_SIZE = 14000
        config.EMBEDDINGS_SIZE = 64
        config.RNN_SIZE = 64 * 2  # Two LSTMs to embed paths, each of size 128
        config.DECODER_SIZE = 150
        config.NUM_DECODER_LAYERS = 1
        config.MAX_PATH_LENGTH = 8 + 1
        config.MAX_NAME_PARTS = 5
        config.MAX_TARGET_PARTS = 6
        config.EMBEDDINGS_DROPOUT_KEEP_PROB = 0.75
        config.RNN_DROPOUT_KEEP_PROB = 0.5
        config.BIRNN = True
        config.RANDOM_CONTEXTS = True
        config.BEAM_WIDTH = 0
        config.USE_MOMENTUM = True
        return config

The original config.py file has:

class Config:
    @staticmethod
    def get_default_config(args):
        config = Config(args)
        config.NUM_EPOCHS = 3000
        config.SAVE_EVERY_EPOCHS = 1
        config.PATIENCE = 10
        config.BATCH_SIZE = 128
        config.READER_NUM_PARALLEL_BATCHES = 1
        config.SHUFFLE_BUFFER_SIZE = 10000
        config.CSV_BUFFER_SIZE = 100 * 1024 * 1024  # 100 MB
        config.MAX_CONTEXTS = 100
        config.SUBTOKENS_VOCAB_MAX_SIZE = 190000
        config.TARGET_VOCAB_MAX_SIZE = 27000
        config.EMBEDDINGS_SIZE = 128
        config.RNN_SIZE = 128 * 2  # Two LSTMs to embed paths, each of size 128
        config.DECODER_SIZE = 320
        config.NUM_DECODER_LAYERS = 1
        config.MAX_PATH_LENGTH = 8 + 1
        config.MAX_NAME_PARTS = 5
        config.MAX_TARGET_PARTS = 6
        config.EMBEDDINGS_DROPOUT_KEEP_PROB = 0.75
        config.RNN_DROPOUT_KEEP_PROB = 0.5
        config.BIRNN = True
        config.RANDOM_CONTEXTS = True
        config.BEAM_WIDTH = 0
        config.USE_MOMENTUM = True
        return config

I run the training script with default option thus I change default part above in config.py :

$ ./train_python150k.sh $DATA_DIR default $CUDA $SEED

Note: the model is still training, so I am not sure what would be the output. It has finished 1 Epoches so far. So it seems my issue was the buffer/shuffle size. However, do you think halving parameters would effect your model training please? If this is not recommended, could be you please give me acceptable parameters sized decreased as your original config.py file configuration give me OOM error.

tech-srl / code2seq

I got Out of Memory Error during Training #108

!/usr/bin/env bash