WARNING: attempting to recover from OOM in forward/backward pass

krasserm / fairseq-image-captioning

Transformer-based image captioning extension for pytorch/fairseq

Apache License 2.0

312 stars 55 forks source link

WARNING: attempting to recover from OOM in forward/backward pass #23

Open pzhren opened 3 years ago

pzhren commented 3 years ago

Hi, I encountered some errors during the Self-critical sequence training stage: WARNING: attempting to recover from OOM in forward/backward pass Is this because the GPU memory is not enough? It feels very strange, because sometimes it is normal.

krasserm commented 3 years ago

Is this because the GPU memory is not enough?

Yes, this is the reason. The settings documented in the README are appropriate for 2 GTX 1080 cards (8 GB each).

pzhren commented 3 years ago

In fact, I used 3 GPUs, each of which is 11g. The strange thing is that sometimes it works normally, and sometimes it is reported that the storage is insufficient.

krasserm commented 3 years ago

Did you pre-train the model with CE loss before running SCST?

pzhren commented 3 years ago

Yes. I passed --max-sentences 2, and it ran normally, but I was worried that it would affect performance. I don't know if it will have a significant impact? Besides, why not use .checkpoint/checkpoint_best.pt, is this not the best weight?

krasserm commented 3 years ago

Convergence improves with higher --max-sentences values (but also requires more memory). A value of 5 should work fine on 11 GB cards.

Regarding checkpoint_best.pt, this is the checkpoint with the best CE validation loss, but not necessarily with the best CIDEr score (or any other evaluation metric). Checkpoint selection based on a user defined metric should be automated but I had other priorities in the past. Hope I can resume work on it anytime soon. Pull requests are welcome too, of course!

pzhren commented 3 years ago

I see, thank you.

pzhren commented 3 years ago

In fact, we found that when SCST was running, one of the GPU memory loads suddenly became too high. There is a serious load imbalance between GPUs. Do you have a good solution?

pzhren commented 3 years ago

Don't worry, I found that during the running process, the memory usage gradually increased. This is the running state at --max-sentence 3.

krasserm commented 3 years ago

What is the frequency of OOMs when you run with --max-sentences 5 or 8?

pzhren commented 3 years ago

Almost every time I encounter it, the strange thing is that it reports a memory error after SCST runs one or two.