Open pzhren opened 3 years ago
Is this because the GPU memory is not enough?
Yes, this is the reason. The settings documented in the README are appropriate for 2 GTX 1080 cards (8 GB each).
In fact, I used 3 GPUs, each of which is 11g. The strange thing is that sometimes it works normally, and sometimes it is reported that the storage is insufficient.
Yes. I passed --max-sentences 2, and it ran normally, but I was worried that it would affect performance. I don't know if it will have a significant impact? Besides, why not use .checkpoint/checkpoint_best.pt, is this not the best weight?
Convergence improves with higher --max-sentences values (but also requires more memory). A value of 5 should work fine on 11 GB cards.
Regarding checkpoint_best.pt
, this is the checkpoint with the best CE validation loss, but not necessarily with the best CIDEr score (or any other evaluation metric). Checkpoint selection based on a user defined metric should be automated but I had other priorities in the past. Hope I can resume work on it anytime soon. Pull requests are welcome too, of course!
I see, thank you.
In fact, we found that when SCST was running, one of the GPU memory loads suddenly became too high. There is a serious load imbalance between GPUs. Do you have a good solution?
Don't worry, I found that during the running process, the memory usage gradually increased. This is the running state at --max-sentence 3.
What is the frequency of OOMs when you run with --max-sentences 5 or 8?
Almost every time I encounter it, the strange thing is that it reports a memory error after SCST runs one or two.
Hi, I encountered some errors during the Self-critical sequence training stage: WARNING: attempting to recover from OOM in forward/backward pass Is this because the GPU memory is not enough? It feels very strange, because sometimes it is normal.