Closed bhigy closed 3 years ago
as memory used during validation should be freed before training resumes
Indeed, but not the other way around; training memory is not freed before validation starts. This means that the validation step can push the GPU memory over its limits if training already takes up a lot of memory. Same goes for the scoring step.
If you want to see for yourself, you could run an experiment with and without the options and check out the GPU memory logs in wandb to see what's happening.
Does this answer your questions? If so, feel free to close.
What do you mean by "training memory"?
I tried to replicate the issue running the basic experiment for a few steps but don't see anything significant (sometime the memory increases by 2 Mib). Do you have a good example I can use to replicate the issue?
I did some more tests with the transformer model but don't observe any memory increase during validation. I would be nice if you can provide an example where this happens, so that I can try to track down the reason.
The setup I used to perform my tests:
python -m 'platalea.experiments.flickr8k.transformer' --epochs 1 --trafo_d_model=512 --trafo_encoder_layers=2 --trafo_heads=2 --trafo_feedforward_dim=512
I went back into my chaotic logs to find out that run dandy-puddle-30 was the last run before I introduced --score-on-cpu
(I introduced it, because dandy-puddle-30 crashed with CUDA out of memory) and run brisk-haze-32 was the first with --score-on-cpu
. They have the same parameters, except --score-on-cpu
.
See this report for graphs: https://wandb.ai/spokenlanguage/platalea_transformer/reports/Memory-saved-by-score-on-cpu--Vmlldzo1MjM4NzE
Note that both runs were started with command:
/home/egpbos/platalea/platalea/experiments/flickr8k/transformer.py -c /home/egpbos/transformer_experiments/0.yml --flickr8k_root=/home/egpbos/flickr8k_linked/ --trafo_d_model=512 --trafo_heads=4 --cyclic_lr_max=4.5e-05 --cyclic_lr_min=7.5e-06 --trafo_dropout=0.5 --epochs=32 --device=cuda:2
This is because I hadn't actually built the --score-on-cpu
option for run 32, I just hacked in the functionality without making it optional/configurable. It should be reproducible though by adding the option.
Can you reproduce this?
Thank you @egpbos, I will try that.
I finally managed to reproduce the issue with commit f10d1dd. The problem comes from the introduction of the parameter trafo_dropout
in commit f21580e , controlling the dropout rate (forced to 0 before that), without using net.eval()
and net.train()
around the scoring function. This was later corrected in commit 216449d, hence my inability to reproduce the issue in current version.
I personally don't see a good use case for the parameters score_on_cpu
and validate_on_cpu
and am thus in favor of simplifying the code by removing them. What do you think @egpbos ?
Oh, interesting! If indeed you see no more differences between using with *_on_cpu
and without, then by all means feel free to remove. They also slow down validation and scoring significantly (x2-3 slower runtimes total), so it was a pretty heavy trade-off anyway. Good riddance!
I just noticed some additions in
basic.py
, which rely on two config varialblesvalidate_on_cpu
andscore_on_cpu
. According to @cwmeijer, this was used to save some memory. It is not clear to me why that would be the case as memory used during validation should be freed before training resumes. @egpbos, do you have more details on this?In addition to satisfying my own curiosity, I would like to be sure we are not missing an issue in our memory management.