Shuffle buffer causes OOM error on CPU (1.10.0)

stefan-falk commented 5 years ago

I noticed that with 1.10.0 a shuffle buffer get build up before training:

2018-11-09 11:48:04.525172: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:97] Filling up shuffle buffer (this may take a while): 391 of 512
2018-11-09 11:48:14.233178: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:97] Filling up shuffle buffer (this may take a while): 396 of 512
2018-11-09 11:48:29.700824: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:97] Filling up shuffle buffer (this may take a while): 400 of 512
2018-11-09 11:48:33.617605: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:97] Filling up shuffle buffer (this may take a while): 402 of 512
2018-11-09 11:48:50.017594: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:97] Filling up shuffle buffer (this may take a while): 406 of 512
2018-11-09 11:48:56.350018: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:97] Filling up shuffle buffer (this may take a while): 407 of 512

However, for one of my larger t2t-problems this seems to cause an OOM error (CPU RAM). I am not sure if this operation happened before 1.10.0 but in any case I'd like to do something against this OOM error.

Why is there a shuffle buffer getting build up and can I disable it or at least control its size s.t. it fits into memory?

Error output:

2018-11-09 11:49:16.324220: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:97] Filling up shuffle buffer (this may take a while): 413 of 512
2018-11-09 11:49:25.588304: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:97] Filling up shuffle buffer (this may take a while): 415 of 512
2018-11-09 11:49:33.819391: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:97] Filling up shuffle buffer (this may take a while): 419 of 512
./train.sh: line 96:   712 Killed     t2t-trainer --generate_data --t2t_usr_dir=$USER_DIR --worker_gpu=$WORKER_GPU --data_dir=$DATA_DIR --tmp_dir=$TMP_DIR --problem=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR --train_steps=50000000 --save_checkpoints_secs=3600 --keep_checkpoint_max=5

stefan-falk commented 5 years ago

I noticed that --hp_batch_size=<value> or --hparams='batch_size=<value>' will trigger the creation of a shuffle buffer.

cbockman commented 5 years ago

Not on t2t team, but I'm guessing you're hitting the internal shuffling:

https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/data_reader.py#L155 https://github.com/tensorflow/tensor2tensor/blob/8bcbdccf85c0fc60f07945c469ff3213d2e0810d/tensor2tensor/data_generators/problem.py#L966 https://github.com/tensorflow/tensor2tensor/blob/8bcbdccf85c0fc60f07945c469ff3213d2e0810d/tensor2tensor/data_generators/problem.py#L559

Why is there a shuffle buffer getting build up and can I disable it or at least control its size s.t. it fits into memory?

One or more of those links should give you the tools to disable.

That said, it is there because otherwise you end up with data being cycled through in a deterministic fashion, which is generally going to be subpar to shuffled data, 1) because you'll be running through the data in the same order and 2) your original data may have hidden deterministic ordering.

If you really don't want to / can't do data set shuffling, then you could instead shard the data to a very high number of shards, since (iirc) t2t will grab the shards in random order.

stefan-falk commented 5 years ago

@cbockman Thanks for answering!

I've already tried to set shuffle_buffer_size though (e.g. --hp_shuffle_buffer_size). For some reason it seems that there are two buffers build up. On that uses size 1024 per default and another that users 512 per default. I was not able to find the place in the code which would set that.

cbockman commented 5 years ago

On that uses size 1024 per default and another that users 512 per default. I was not able to find the place in the code which would set that.

I linked you this above, second link I provided is where the 512 batch is used. You can ctrl-f to trace it back to https://github.com/tensorflow/tensor2tensor/blob/8bcbdccf85c0fc60f07945c469ff3213d2e0810d/tensor2tensor/data_generators/problem.py#L806, where it is specifically set.

stefan-falk commented 5 years ago

@cbockman Ah, thank you I didn't realize. And now I also see that this behavior is new and actually came with 1.10.0!

What I do not understand is - 512 samples - that's just not that huge yet building the buffer consumes 64GB RAM and my entire swap device.

What is happening there.

JanithT-Lboro commented 5 years ago

Hello, @stefan-falk Did you ever find a solution to your problem as I am experiencing the same issue currently?

stefan-falk commented 5 years ago

@JanithT-Lboro I created a PR (https://github.com/tensorflow/tensor2tensor/pull/1231) which got accepted and I think in the latest version you should be able to set the parameter e.g. --hparams=batch_shuffle_size=0 to turn it off

Another workaround, if you cannot upgrade to 1.12.0, would be to derive tensor2tensor Problem.input_fn() and pass a different value (or None as I did) to the method:

def input_fn(self,
             mode,
             hparams,
             data_dir=None,
             params=None,
             config=None,
             force_repeat=False,
             prevent_repeat=False,
             dataset_kwargs=None,
             batch_shuffle_size=512):

    # TODO In t2t<1.11 we cannot disable batch_shuffle_size - this should not be necessary staring with v1.12

    return super().input_fn(mode,
                            hparams,
                            data_dir=data_dir,
                            params=params,
                            config=config,
                            force_repeat=force_repeat,
                            prevent_repeat=prevent_repeat,
                            dataset_kwargs=dataset_kwargs,
                            batch_shuffle_size=None)

tensorflow / tensor2tensor

Shuffle buffer causes OOM error on CPU (1.10.0) #1210