Open martinpopel opened 6 years ago
In 1.5.1 I experience OOM when using schedule=train
even after 12h.
When using schedule=continuous_train_and_eval
training doesn't crash, but actual bleu is lower than with schedule=train
.
@mehmedes: The continuous_train_and_eval
bug is reported in #556.
It seems that continuous_train_and_eval
starts training from a pseudo-random position in the data after each evaluation (according to @rsepassi it won't see the data as evenly as schedule=train). So it is possible that continuous_train_and_eval
will crash as well once it gets to the position where schedule=train crashed, but it cannot be predicted when it will happen (theoretically it can take more then one epoch, considering that continuous_train_and_eval
samples the training data with replacement).
If I train with schedule=train
and the training crashes after 1 day or so, will T2T restart training from scratch if I restart training or will it continue from where it crashed?
@mehmedes: the model weights and Adam momentum are stored in the checkpoint, so in this sense the training continues. However, according to https://github.com/tensorflow/tensor2tensor/issues/556#issuecomment-364306094 the position in the training data is not stored in the checkpoint (nor there is any attempt to reconstruct it from the global_step and batch_size), so the resumed training starts from scratch, which means choosing a random training file (but with a fixed rand seed it should be always the same file).
So, restarting training wouldn't actually help as it would only train upto the same point where it always crashes and not go beyong that?
Yes, my experience is that restarting does not help with OOM, it crashes after about the same time. Of course, there may be some non-determinism hidden somewhere. You can try increasing worker_gpu_memory_fraction
from 0.95 to something higher (but lower than 1.0).
Does this continue to happen with eval_drop_long_sequences=True
? I'm wondering if there's a super big eval batch that gets created and OOMs.
Oh, I forgot to mention all my experiments with OOM fails reported here were with --schedule=train
, so eval_drop_long_sequences
makes no difference because there was no evaluation.
Interesting. And how is the OOM reported? Is the error consistent? Do you have a stack trace? On Fri, May 4, 2018 at 12:13 AM Martin Popel notifications@github.com wrote:
Oh, I forgot to mention all my experiments with OOM fails reported here were with --schedule=train, so eval_drop_long_sequences makes no difference because there was no evaluation.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensor2tensor/issues/581#issuecomment-386521893, or mute the thread https://github.com/notifications/unsubscribe-auth/ABEGWwsNv1XXyBzN_Td2FDs-ZRSGPyIPks5tu_-AgaJpZM4SEK1f .
The OOM is reported as any other OOM - with hundred pages memory dump (if really needed, I can find it in old logs and paste somewhere). It is quite consistent, see e.g. this description: https://github.com/tensorflow/tensor2tensor/issues/637#issuecomment-370750170
A short update about this old issue: the memory complexity of the self-attention layer is quadratic in the length of the sequence. So e.g. with batch_size=2000
(tokens), we can have different memory requirements: a batch with two 1000-token sentences takes much more memory a batch with 200 ten-token sentences.
I am not sure, but maybe the memory requirements are affected also by the amount of zero-padding in the batch.
Even if we set max_length
, the memory requirements will vary between batches.
A possible solution would be add a code that would create shorter batches than batch_size
if needed to prevent the OOM (i.e. in the case of a small number of long sentences, but otherwise, use it would use the full batch_size
). I mean the code would have to watch for the available memory and it would affect the batching and bucketing. It should also report how often this shortening was needed, so the users have a feedback. I guess this is not so easy to implement and it would make the code more cryptic. So meanwhile, just follow the advice given above: set batch_size
and max_length
reasonably low to prevent OOM (by trying several combinations).
Why not creating a random batch with the maximum memory requirements and use it as first training batch? Fail fast approach: if that first batch leads to OOM, then the user just reduces the batch size; if that first batch fits, every other batch should fit too, isn't it?
I think that's what fairseq does.
Good idea. That seems easier to implement - all the sentences in the dummy batch should have max_length
tokens (both source and target side). If no max_length
is specified than there should be just a single sentence with batch_size
tokens (which is perhaps too pessimistic, but there is no other way how to prevent OOM; if needed there can be a non-zero default max_length
).
If I set
batch_size
too high, the training fails with Out Of Memory error. This is expected. The annoying point is that the failure can happen e.g. after 3 hours of training (15k steps) or even after two and half days of training (150k steps) . Thus it is difficult to find the highest possible batch size (for a given model, optimizer and max_length, which are other hyperparameters influencing the memory consumption I am aware of).When I set
batch_size
much too high, the training fails immediately (i.e. in two minutes, after building the graph on GPU and saving the first checkpoint). However, when I setbatch_size
just a bit too high, the failure comes usually in the first few thousand steps, but sometimes much later (the 3 hours, 15000 steps). Moreover, it seems the number of steps before OOM failure is non-deterministic and it is not always true that higherbatch_size
fails earlier. Also it is not true that the maximal batch size can be approximated as the maximal batch size that fails immediately minus the max_length (which was my original hypothesis) - it can be be much lower. BTW: In https://github.com/tensorflow/tensor2tensor/issues/444#issuecomment-362917578 I report the empirical maximum batch size for a given model, optimizer and max_length.My guess is that there is something wrong with the batching and bucketing. I was lazy to check the source code carefully. Maybe it is allowed to exceed the
batch_size
(the first sentence which exceeds the given number of subwords is kept in the batch, although it should be excluded). Maybe the memory is consumed by the queue of sentences for bucketing (although I would say, these are in the CPU RAM, not GPU memory). Anyway, it would be nice if the training fails either immediately or never.