Performance of cudalstm slower after upgrade to v1.7.0 and recent pytorch images

neuralhydrology / neuralhydrology

Python library to train neural networks with a strong focus on hydrological applications.

https://neuralhydrology.readthedocs.io/

BSD 3-Clause "New" or "Revised" License

330 stars 164 forks source link

Performance of cudalstm slower after upgrade to v1.7.0 and recent pytorch images #147

Open shamshaw opened 9 months ago

shamshaw commented 9 months ago

Hi, I was wondering if you all have done any benchmarking of the cudalstm model with recent package updates and running on more recent pytorch images? I know there are a lot of variables to consider with computing environment and I'm sure this could be an issue on our end, but we have been seeing on our project a significant increase in training time after updating neuralhydrology and pytorch.

For training a cudalstm model on ~400 sites we have seen a shift like this: neuralhydrology v1.4.0 on a pytorch 1.11/CUDA 11.3/python 3.8 image = ~40 iterations/sec neuralhydrology v1.7.0 on a pytorch 2.00/CUDA 11.8/python 3.10 image = ~10 iterations/sec

All other model configuration settings and size of hardware (running in AWS Sagemaker) remain the same. I'll keep investigating our setup, but just thought I'd double check you haven't seen this type of change in training times. Thanks!

kratzert commented 9 months ago

Hi Scott,

thanks for reporting this. I'm not using NH myself as extensive as in the past and only use it here and there for experiments that we intent to publish (for reproduciblity reasons). So I can not say that I carefully benchmarked anything recently. That being said, I trained ~100k models last month for an upcoming paper using the most recent version back then and I didn't had the feeling that it is much slower. Also, I am not aware of any change on our side that would have any effect on the training loop in the last versions. That being said, I see if I can spin up a GCP VM and benchmark the speed of different versions.

Also just to double check: I assume this was not something you noticed for one random run but is a general trend? Just to exclude the option that maybe data loading/processing was slowed for one run to due some heavy CPU load from some other program, which could also effect the training speed (i.e. the dataloader is not able to provide mini-batches as fast as required to keep the GPU busy).

shamshaw commented 9 months ago

Yes, this is something we have seen across a number of experiments and has been consistent with trying different numbers of dataloader workers, tensorboard logging enabled, and some experimentation with other configuration settings. I don't believe it is a performance (RAM/CPU/GPU) bottleneck as we've tried it on various sized computing instances including ones with more powerful hardware then we had been using before. But, we still haven't ruled out there is something happening with the AWS Sagemaker computing environments we are using with some configuration setting or something that is causing the difference in training speed. Appreciate you looking into the benchmarking!

kratzert commented 9 months ago

Are these numbers from a public dataset? E.g. could we share a config for let's say CAMELS US and we both run the same config on different machines, then compare numbers? I can attach some random NH config once I am home or if you have any at hand, feel free to drop it here.

shamshaw commented 9 months ago

The above was from using a project dataset that isn't public yet. But I just uploaded CAMELS_US to our account and can test using that dataset. I could run other NH_configs, but here's just one that I just used to test using the CAMELS US dataset (I changed the extension to .txt so it would upload here) camels_us_multi_basin.yml.txt

shamshaw commented 6 months ago

Hey folks, just following up on this that the training performance difference I was seeing does not seem to be directly related to neural-hydrology versions. I was able to do some further testing and got similar training iteration speed between 1.4.0 and 1.9.1 versions. FWIW I'm still trying to figure out why training speeds are changing a lot with the only difference being switching in AWS Sagemaker to a newer PyTorch image (no changes to dataset or code), but I think that's something that is independent of neural-hydrology. Having some benchmarks to compare against would still be helpful just as a point of comparison, but welcome to close this issue. Thanks

kratzert commented 5 months ago

Thanks for reporting this feedback. Happy to hear I don't need to deep-dive into solving new performance issues in NH :joy: Regarding benchmarking table: Sounds actually like a nice idea. What kind of information would you put in such a table that you consider it interesting for users? E.g. let's say we pick the default cuda lstm, there would be batch size, hidden size, sequence length and GPU model that needs to be reported to get some estimate of the it/s speed, no? Maybe added the num_workers argument for the dataloader?