tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.38k stars 3.48k forks source link

OOM Error after long training #581

Open martinpopel opened 6 years ago

martinpopel commented 6 years ago

If I set batch_size too high, the training fails with Out Of Memory error. This is expected. The annoying point is that the failure can happen e.g. after 3 hours of training (15k steps) or even after two and half days of training (150k steps) . Thus it is difficult to find the highest possible batch size (for a given model, optimizer and max_length, which are other hyperparameters influencing the memory consumption I am aware of).

When I set batch_size much too high, the training fails immediately (i.e. in two minutes, after building the graph on GPU and saving the first checkpoint). However, when I set batch_size just a bit too high, the failure comes usually in the first few thousand steps, but sometimes much later (the 3 hours, 15000 steps). Moreover, it seems the number of steps before OOM failure is non-deterministic and it is not always true that higher batch_size fails earlier. Also it is not true that the maximal batch size can be approximated as the maximal batch size that fails immediately minus the max_length (which was my original hypothesis) - it can be be much lower. BTW: In https://github.com/tensorflow/tensor2tensor/issues/444#issuecomment-362917578 I report the empirical maximum batch size for a given model, optimizer and max_length.

My guess is that there is something wrong with the batching and bucketing. I was lazy to check the source code carefully. Maybe it is allowed to exceed the batch_size (the first sentence which exceeds the given number of subwords is kept in the batch, although it should be excluded). Maybe the memory is consumed by the queue of sentences for bucketing (although I would say, these are in the CPU RAM, not GPU memory). Anyway, it would be nice if the training fails either immediately or never.

mehmedes commented 6 years ago

In 1.5.1 I experience OOM when using schedule=train even after 12h. When using schedule=continuous_train_and_eval training doesn't crash, but actual bleu is lower than with schedule=train.

martinpopel commented 6 years ago

@mehmedes: The continuous_train_and_eval bug is reported in #556. It seems that continuous_train_and_eval starts training from a pseudo-random position in the data after each evaluation (according to @rsepassi it won't see the data as evenly as schedule=train). So it is possible that continuous_train_and_eval will crash as well once it gets to the position where schedule=train crashed, but it cannot be predicted when it will happen (theoretically it can take more then one epoch, considering that continuous_train_and_eval samples the training data with replacement).

mehmedes commented 6 years ago

If I train with schedule=train and the training crashes after 1 day or so, will T2T restart training from scratch if I restart training or will it continue from where it crashed?

martinpopel commented 6 years ago

@mehmedes: the model weights and Adam momentum are stored in the checkpoint, so in this sense the training continues. However, according to https://github.com/tensorflow/tensor2tensor/issues/556#issuecomment-364306094 the position in the training data is not stored in the checkpoint (nor there is any attempt to reconstruct it from the global_step and batch_size), so the resumed training starts from scratch, which means choosing a random training file (but with a fixed rand seed it should be always the same file).

mehmedes commented 6 years ago

So, restarting training wouldn't actually help as it would only train upto the same point where it always crashes and not go beyong that?

martinpopel commented 6 years ago

Yes, my experience is that restarting does not help with OOM, it crashes after about the same time. Of course, there may be some non-determinism hidden somewhere. You can try increasing worker_gpu_memory_fraction from 0.95 to something higher (but lower than 1.0).

rsepassi commented 6 years ago

Does this continue to happen with eval_drop_long_sequences=True? I'm wondering if there's a super big eval batch that gets created and OOMs.

martinpopel commented 6 years ago

Oh, I forgot to mention all my experiments with OOM fails reported here were with --schedule=train, so eval_drop_long_sequences makes no difference because there was no evaluation.

rsepassi commented 6 years ago

Interesting. And how is the OOM reported? Is the error consistent? Do you have a stack trace? On Fri, May 4, 2018 at 12:13 AM Martin Popel notifications@github.com wrote:

Oh, I forgot to mention all my experiments with OOM fails reported here were with --schedule=train, so eval_drop_long_sequences makes no difference because there was no evaluation.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensor2tensor/issues/581#issuecomment-386521893, or mute the thread https://github.com/notifications/unsubscribe-auth/ABEGWwsNv1XXyBzN_Td2FDs-ZRSGPyIPks5tu_-AgaJpZM4SEK1f .

martinpopel commented 6 years ago

The OOM is reported as any other OOM - with hundred pages memory dump (if really needed, I can find it in old logs and paste somewhere). It is quite consistent, see e.g. this description: https://github.com/tensorflow/tensor2tensor/issues/637#issuecomment-370750170

martinpopel commented 5 years ago

A short update about this old issue: the memory complexity of the self-attention layer is quadratic in the length of the sequence. So e.g. with batch_size=2000 (tokens), we can have different memory requirements: a batch with two 1000-token sentences takes much more memory a batch with 200 ten-token sentences. I am not sure, but maybe the memory requirements are affected also by the amount of zero-padding in the batch. Even if we set max_length, the memory requirements will vary between batches.

A possible solution would be add a code that would create shorter batches than batch_size if needed to prevent the OOM (i.e. in the case of a small number of long sentences, but otherwise, use it would use the full batch_size). I mean the code would have to watch for the available memory and it would affect the batching and bucketing. It should also report how often this shortening was needed, so the users have a feedback. I guess this is not so easy to implement and it would make the code more cryptic. So meanwhile, just follow the advice given above: set batch_size and max_length reasonably low to prevent OOM (by trying several combinations).

noe commented 5 years ago

Why not creating a random batch with the maximum memory requirements and use it as first training batch? Fail fast approach: if that first batch leads to OOM, then the user just reduces the batch size; if that first batch fits, every other batch should fit too, isn't it?

I think that's what fairseq does.

martinpopel commented 5 years ago

Good idea. That seems easier to implement - all the sentences in the dummy batch should have max_length tokens (both source and target side). If no max_length is specified than there should be just a single sentence with batch_size tokens (which is perhaps too pessimistic, but there is no other way how to prevent OOM; if needed there can be a non-zero default max_length).