Closed stefan-falk closed 5 years ago
I noticed that --hp_batch_size=<value>
or --hparams='batch_size=<value>'
will trigger the creation of a shuffle buffer.
Not on t2t team, but I'm guessing you're hitting the internal shuffling:
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/data_reader.py#L155 https://github.com/tensorflow/tensor2tensor/blob/8bcbdccf85c0fc60f07945c469ff3213d2e0810d/tensor2tensor/data_generators/problem.py#L966 https://github.com/tensorflow/tensor2tensor/blob/8bcbdccf85c0fc60f07945c469ff3213d2e0810d/tensor2tensor/data_generators/problem.py#L559
Why is there a shuffle buffer getting build up and can I disable it or at least control its size s.t. it fits into memory?
One or more of those links should give you the tools to disable.
That said, it is there because otherwise you end up with data being cycled through in a deterministic fashion, which is generally going to be subpar to shuffled data, 1) because you'll be running through the data in the same order and 2) your original data may have hidden deterministic ordering.
If you really don't want to / can't do data set shuffling, then you could instead shard the data to a very high number of shards, since (iirc) t2t will grab the shards in random order.
@cbockman Thanks for answering!
I've already tried to set shuffle_buffer_size
though (e.g. --hp_shuffle_buffer_size
). For some reason it seems that there are two buffers build up. On that uses size 1024 per default and another that users 512 per default. I was not able to find the place in the code which would set that.
On that uses size 1024 per default and another that users 512 per default. I was not able to find the place in the code which would set that.
I linked you this above, second link I provided is where the 512 batch is used. You can ctrl-f to trace it back to https://github.com/tensorflow/tensor2tensor/blob/8bcbdccf85c0fc60f07945c469ff3213d2e0810d/tensor2tensor/data_generators/problem.py#L806, where it is specifically set.
@cbockman Ah, thank you I didn't realize. And now I also see that this behavior is new and actually came with 1.10.0!
What I do not understand is - 512 samples - that's just not that huge yet building the buffer consumes 64GB RAM and my entire swap device.
What is happening there.
Hello, @stefan-falk Did you ever find a solution to your problem as I am experiencing the same issue currently?
@JanithT-Lboro I created a PR (https://github.com/tensorflow/tensor2tensor/pull/1231) which got accepted and I think in the latest version you should be able to set the parameter e.g. --hparams=batch_shuffle_size=0
to turn it off
Another workaround, if you cannot upgrade to 1.12.0
, would be to derive tensor2tensor Problem.input_fn()
and pass a different value (or None
as I did) to the method:
def input_fn(self,
mode,
hparams,
data_dir=None,
params=None,
config=None,
force_repeat=False,
prevent_repeat=False,
dataset_kwargs=None,
batch_shuffle_size=512):
# TODO In t2t<1.11 we cannot disable batch_shuffle_size - this should not be necessary staring with v1.12
return super().input_fn(mode,
hparams,
data_dir=data_dir,
params=params,
config=config,
force_repeat=force_repeat,
prevent_repeat=prevent_repeat,
dataset_kwargs=dataset_kwargs,
batch_shuffle_size=None)
I noticed that with 1.10.0 a shuffle buffer get build up before training:
However, for one of my larger t2t-problems this seems to cause an OOM error (CPU RAM). I am not sure if this operation happened before 1.10.0 but in any case I'd like to do something against this OOM error.
Why is there a shuffle buffer getting build up and can I disable it or at least control its size s.t. it fits into memory?
Error output: