keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.91k stars 19.45k forks source link

[DEV] Study of the memory usage of keras tests in travis.CI #12021

Closed gabrieldemarmiesse closed 3 years ago

gabrieldemarmiesse commented 5 years ago

It seems that the Travis timout is back again. If we want to tackle it and that we think it's a memory issue, we should discuss with hard numbers and try to avoid guesses when possible, which is why I made this small study.

After doing a lot of printing and parsing of travis logs, I managed to obtain what I think is useful information, even if I don't know how to reduce the memory usage.

How I gathered those numbers:

By adding a warning after each test (by using pytest's fixture at the top level), I was able to display the current memory consumption of each process running the tests in the travis logs. With a bit of parsing, I've put together those plots as well as the functions which saw a huge memory increase after their execution.

Note that I ran gc.collect() before each measurement of memory usage to avoid measuring memory which was going to be freed afterwards.

We have two processes running for each build, which is why we have two pids for each build.

Note that at the time of measurement, the test function has been executed, so the memory allocated should be have already been freed.

There are 12 plots. 3 backends 2 python versions 2 process per build.

Maybe related issues:

11288

10340

11461

10100

10071

11984

@fchollet @Dref360 @taehoonlee @farizrahman4u

Can we conclude something from those numbers? Does it look strange? I don't know much about python memory management (just the basic reference counting).

tensorflow_27_gc_pid_4987 tensorflow_27_gc_pid_4987 Test number 98 leaked 78.0 MB. The name is test_TensorBoard[batch] Test number 273 leaked 79.0 MB. The name is test_dropout[GRU] Test number 274 leaked 51.0 MB. The name is test_dropout[LSTM] Test number 284 leaked 93.0 MB. The name is test_implementation_mode[GRU] Test number 285 leaked 50.0 MB. The name is test_implementation_mode[LSTM]


tensorflow_27_gc_pid_4990 tensorflow_27_gc_pid_4990 Test number 4 leaked 33.0 MB. The name is test_sequential_temporal_sample_weights Test number 18 leaked 30.0 MB. The name is test_saving_model_with_long_weights_names Test number 212 leaked 20.0 MB. The name is test_model_methods Test number 240 leaked 336.0 MB. The name is test_convolutional_recurrent Test number 344 leaked 102.0 MB. The name is test_Bidirectional


tensorflow_36_gc_pid_5043 tensorflow_36_gc_pid_5043 Test number 2 leaked 10.58984375 MB. The name is test_masking_is_all_zeros Test number 4 leaked 25.25390625 MB. The name is test_sequential_temporal_sample_weights Test number 18 leaked 18.25 MB. The name is test_saving_model_with_long_weights_names Test number 243 leaked 170.3515625 MB. The name is test_convolutional_recurrent Test number 304 leaked 7.171875 MB. The name is test_implementation_mode[LSTM]


tensorflow_36_gc_pid_5046 tensorflow_36_gc_pid_5046 Test number 2 leaked 11.18359375 MB. The name is test_masking Test number 104 leaked 54.80078125 MB. The name is test_TensorBoard[batch] Test number 108 leaked 18.1328125 MB. The name is test_TensorBoard_multi_input_output Test number 325 leaked 11.9921875 MB. The name is test_builtin_rnn_cell_layer[LSTMCell] Test number 344 leaked 43.7890625 MB. The name is test_Bidirectional


theano_27_gc_pid_5025 theano_27_gc_pid_5025 Test number 1 leaked 38.0 MB. The name is test_model_trainability_switch Test number 11 leaked 12.0 MB. The name is test_sequential_model_saving_2 Test number 92 leaked 20.0 MB. The name is test_TensorBoard[batch] Test number 93 leaked 11.0 MB. The name is test_TensorBoard[epoch] Test number 141 leaked 20.0 MB. The name is test_rnn


theano_27_gc_pid_5028 theano_27_gc_pid_5028 Test number 0 leaked 39.0 MB. The name is test_layer_trainability_switch Test number 4 leaked 21.0 MB. The name is test_sequential_temporal_sample_weights Test number 11 leaked 11.0 MB. The name is test_functional_model_saving Test number 91 leaked 12.0 MB. The name is test_in_top_k Test number 164 leaked 88.0 MB. The name is test_convolutional_recurrent


theano_3 6_gc_pid_5091 theano_3.6_gc_pid_5091 Test number 1 leaked 26.234375 MB. The name is test_model_trainability_switch Test number 2 leaked 7.52734375 MB. The name is test_masking Test number 11 leaked 8.48828125 MB. The name is test_sequential_model_saving_2 Test number 99 leaked 12.07421875 MB. The name is test_in_top_k Test number 174 leaked 57.9140625 MB. The name is test_convolutional_recurrent


theano_3 6_gc_pid_5094 theano_3.6_gc_pid_5094 Test number 0 leaked 25.21875 MB. The name is test_layer_trainability_switch Test number 4 leaked 16.9453125 MB. The name is test_sequential_temporal_sample_weights Test number 97 leaked 30.34765625 MB. The name is test_TensorBoard[batch] Test number 143 leaked 7.71484375 MB. The name is test_gradient Test number 146 leaked 13.90625 MB. The name is test_rnn


cntk_27_gc_pid_6848 cntk_27_gc_pid_6848 Test number 18 leaked 4.0 MB. The name is test_saving_model_with_long_layer_names Test number 43 leaked 5.0 MB. The name is test_orthogonal[CONV] Test number 96 leaked 34.0 MB. The name is test_TensorBoard[batch] Test number 100 leaked 5.0 MB. The name is test_TensorBoard_convnet Test number 167 leaked 17.0 MB. The name is test_convolutional_recurrent


cntk_27_gc_pid_6851 cntk_27_gc_pid_6851 Test number 4 leaked 9.0 MB. The name is test_sequential_temporal_sample_weights Test number 97 leaked 27.0 MB. The name is test_in_top_k Test number 288 leaked 5.0 MB. The name is test_dropout[LSTM] Test number 315 leaked 6.0 MB. The name is test_Bidirectional Test number 340 leaked 10.0 MB. The name is test_plot_sequential_embedding


cntk_36_gc_pid_6968 cntk_36_gc_pid_6968 Test number 2 leaked 3.2265625 MB. The name is test_masking Test number 4 leaked 4.234375 MB. The name is test_sequential_sample_weights Test number 49 leaked 5.31640625 MB. The name is test_orthogonal[CONV] Test number 102 leaked 32.3359375 MB. The name is test_TensorBoard[batch] Test number 106 leaked 4.4765625 MB. The name is test_TensorBoard_convnet


cntk_36_gc_pid_6971 cntk_36_gc_pid_6971 Test number 0 leaked 2.75 MB. The name is test_layer_trainability_switch Test number 4 leaked 10.35546875 MB. The name is test_sequential_temporal_sample_weights Test number 42 leaked 5.02734375 MB. The name is test_orthogonal[FC] Test number 97 leaked 19.2890625 MB. The name is test_in_top_k Test number 209 leaked 15.64453125 MB. The name is test_convolutional_recurrent

Dref360 commented 5 years ago

Tensorboard seems to be causing problems. Interesting!!! conv_recurrent is causing problems for 6 runs. top_k is also causing problems for 4 runs.

Great work! I will be able to investigate next week. If anyone wants to investigate this weekend, please let me know and PM me on Slack.

Dref360 commented 5 years ago

Do you observe the same behaviour if you disable pytest-xdist?

RaphaelMeudec commented 5 years ago

I'll try to dig into it this week-end, Will keep you updated. @gabrieldemarmiesse could you push your work in a gist or on a branch ?

gabrieldemarmiesse commented 5 years ago

@RaphaelMeudec @Dref360 Thanks for the help. If you want the code I used to do this, you can find it in this branch https://github.com/gabrieldemarmiesse/keras/tree/investigate_timeout

The two important files are the conftest.py which prints the warnings and the plot_mem_usage.py to parse the travis logs and plot.

You need to activate travis for your fork of keras. It's free. Then you need to download the logs as text files and put them in the logdir directory. Run the plot_mem_usage and you should be good to go. Do not forget to reactivate the python warnings in the .travis.yml for cntk. Currently warnings are disabled, which hides the warnings that prints the memory usage.

Dref360 commented 5 years ago

Any update @RaphaelMeudec? You can PM me on Slack (same username) so that we can work on this together.

RaphaelMeudec commented 5 years ago

@Dref360 Had some work so I haven't started yet. Will reach out to you when I start investigating

Dref360 commented 5 years ago

Hello,

I made a script to figure out where the memory is leaking. So far, I've found that there are some tensorflow ops that are not being released. Maybe we're doing something wrong to release the memory.

Gist : https://gist.github.com/Dref360/c2898d3e09d5286a6970203acf8a2b5f

Skip to the line 126, before that it's just a copy-paste from test_Tensorboard which is somewhat a good test to reproduce the leakage.

linecache is used by tracemalloc. modify the number of frame at line 146. modify filters at line 160 to pinpoint modules. (TF and threadings are the two most important)

lichuanx commented 5 years ago

Interesting and useful work, waiting for solution.