Memory optimizations in basic.py

bhigy commented 3 years ago

I just noticed some additions in basic.py, which rely on two config varialbles validate_on_cpu and score_on_cpu. According to @cwmeijer, this was used to save some memory. It is not clear to me why that would be the case as memory used during validation should be freed before training resumes. @egpbos, do you have more details on this?

In addition to satisfying my own curiosity, I would like to be sure we are not missing an issue in our memory management.

egpbos commented 3 years ago

as memory used during validation should be freed before training resumes

Indeed, but not the other way around; training memory is not freed before validation starts. This means that the validation step can push the GPU memory over its limits if training already takes up a lot of memory. Same goes for the scoring step.

If you want to see for yourself, you could run an experiment with and without the options and check out the GPU memory logs in wandb to see what's happening.

Does this answer your questions? If so, feel free to close.

bhigy commented 3 years ago

What do you mean by "training memory"?

I tried to replicate the issue running the basic experiment for a few steps but don't see anything significant (sometime the memory increases by 2 Mib). Do you have a good example I can use to replicate the issue?

bhigy commented 3 years ago

I did some more tests with the transformer model but don't observe any memory increase during validation. I would be nice if you can provide an example where this happens, so that I can try to track down the reason.

The setup I used to perform my tests:

python -m 'platalea.experiments.flickr8k.transformer' --epochs 1 --trafo_d_model=512 --trafo_encoder_layers=2 --trafo_heads=2 --trafo_feedforward_dim=512

egpbos commented 3 years ago

I went back into my chaotic logs to find out that run dandy-puddle-30 was the last run before I introduced --score-on-cpu (I introduced it, because dandy-puddle-30 crashed with CUDA out of memory) and run brisk-haze-32 was the first with --score-on-cpu. They have the same parameters, except --score-on-cpu.

See this report for graphs: https://wandb.ai/spokenlanguage/platalea_transformer/reports/Memory-saved-by-score-on-cpu--Vmlldzo1MjM4NzE

Note that both runs were started with command:

/home/egpbos/platalea/platalea/experiments/flickr8k/transformer.py -c /home/egpbos/transformer_experiments/0.yml --flickr8k_root=/home/egpbos/flickr8k_linked/ --trafo_d_model=512 --trafo_heads=4 --cyclic_lr_max=4.5e-05 --cyclic_lr_min=7.5e-06 --trafo_dropout=0.5 --epochs=32 --device=cuda:2

This is because I hadn't actually built the --score-on-cpu option for run 32, I just hacked in the functionality without making it optional/configurable. It should be reproducible though by adding the option.

Can you reproduce this?

bhigy commented 3 years ago

Thank you @egpbos, I will try that.

bhigy commented 3 years ago

I finally managed to reproduce the issue with commit f10d1dd. The problem comes from the introduction of the parameter trafo_dropout in commit f21580e , controlling the dropout rate (forced to 0 before that), without using net.eval() and net.train() around the scoring function. This was later corrected in commit 216449d, hence my inability to reproduce the issue in current version.

I personally don't see a good use case for the parameters score_on_cpu and validate_on_cpu and am thus in favor of simplifying the code by removing them. What do you think @egpbos ?

egpbos commented 3 years ago

Oh, interesting! If indeed you see no more differences between using with *_on_cpu and without, then by all means feel free to remove. They also slow down validation and scoring significantly (x2-3 slower runtimes total), so it was a pretty heavy trade-off anyway. Good riddance!

spokenlanguage / platalea

Memory optimizations in basic.py #80