Closed gabrieldemarmiesse closed 3 years ago
Tensorboard seems to be causing problems. Interesting!!! conv_recurrent is causing problems for 6 runs. top_k is also causing problems for 4 runs.
Great work! I will be able to investigate next week. If anyone wants to investigate this weekend, please let me know and PM me on Slack.
Do you observe the same behaviour if you disable pytest-xdist?
I'll try to dig into it this week-end, Will keep you updated. @gabrieldemarmiesse could you push your work in a gist or on a branch ?
@RaphaelMeudec @Dref360 Thanks for the help. If you want the code I used to do this, you can find it in this branch https://github.com/gabrieldemarmiesse/keras/tree/investigate_timeout
The two important files are the conftest.py which prints the warnings and the plot_mem_usage.py to parse the travis logs and plot.
You need to activate travis for your fork of keras. It's free. Then you need to download the logs as text files and put them in the logdir directory. Run the plot_mem_usage and you should be good to go. Do not forget to reactivate the python warnings in the .travis.yml
for cntk. Currently warnings are disabled, which hides the warnings that prints the memory usage.
Any update @RaphaelMeudec? You can PM me on Slack (same username) so that we can work on this together.
@Dref360 Had some work so I haven't started yet. Will reach out to you when I start investigating
Hello,
I made a script to figure out where the memory is leaking. So far, I've found that there are some tensorflow ops that are not being released. Maybe we're doing something wrong to release the memory.
Gist : https://gist.github.com/Dref360/c2898d3e09d5286a6970203acf8a2b5f
Skip to the line 126, before that it's just a copy-paste from test_Tensorboard which is somewhat a good test to reproduce the leakage.
linecache is used by tracemalloc. modify the number of frame at line 146. modify filters at line 160 to pinpoint modules. (TF and threadings are the two most important)
Interesting and useful work, waiting for solution.
It seems that the Travis timout is back again. If we want to tackle it and that we think it's a memory issue, we should discuss with hard numbers and try to avoid guesses when possible, which is why I made this small study.
After doing a lot of printing and parsing of travis logs, I managed to obtain what I think is useful information, even if I don't know how to reduce the memory usage.
How I gathered those numbers:
By adding a warning after each test (by using pytest's fixture at the top level), I was able to display the current memory consumption of each process running the tests in the travis logs. With a bit of parsing, I've put together those plots as well as the functions which saw a huge memory increase after their execution.
Note that I ran
gc.collect()
before each measurement of memory usage to avoid measuring memory which was going to be freed afterwards.We have two processes running for each build, which is why we have two pids for each build.
Note that at the time of measurement, the test function has been executed, so the memory allocated should be have already been freed.
There are 12 plots. 3 backends 2 python versions 2 process per build.
Maybe related issues:
11288
10340
11461
10100
10071
11984
@fchollet @Dref360 @taehoonlee @farizrahman4u
Can we conclude something from those numbers? Does it look strange? I don't know much about python memory management (just the basic reference counting).
tensorflow_27_gc_pid_4987 Test number 98 leaked 78.0 MB. The name is test_TensorBoard[batch] Test number 273 leaked 79.0 MB. The name is test_dropout[GRU] Test number 274 leaked 51.0 MB. The name is test_dropout[LSTM] Test number 284 leaked 93.0 MB. The name is test_implementation_mode[GRU] Test number 285 leaked 50.0 MB. The name is test_implementation_mode[LSTM]
tensorflow_27_gc_pid_4990 Test number 4 leaked 33.0 MB. The name is test_sequential_temporal_sample_weights Test number 18 leaked 30.0 MB. The name is test_saving_model_with_long_weights_names Test number 212 leaked 20.0 MB. The name is test_model_methods Test number 240 leaked 336.0 MB. The name is test_convolutional_recurrent Test number 344 leaked 102.0 MB. The name is test_Bidirectional
tensorflow_36_gc_pid_5043 Test number 2 leaked 10.58984375 MB. The name is test_masking_is_all_zeros Test number 4 leaked 25.25390625 MB. The name is test_sequential_temporal_sample_weights Test number 18 leaked 18.25 MB. The name is test_saving_model_with_long_weights_names Test number 243 leaked 170.3515625 MB. The name is test_convolutional_recurrent Test number 304 leaked 7.171875 MB. The name is test_implementation_mode[LSTM]
tensorflow_36_gc_pid_5046 Test number 2 leaked 11.18359375 MB. The name is test_masking Test number 104 leaked 54.80078125 MB. The name is test_TensorBoard[batch] Test number 108 leaked 18.1328125 MB. The name is test_TensorBoard_multi_input_output Test number 325 leaked 11.9921875 MB. The name is test_builtin_rnn_cell_layer[LSTMCell] Test number 344 leaked 43.7890625 MB. The name is test_Bidirectional
theano_27_gc_pid_5025 Test number 1 leaked 38.0 MB. The name is test_model_trainability_switch Test number 11 leaked 12.0 MB. The name is test_sequential_model_saving_2 Test number 92 leaked 20.0 MB. The name is test_TensorBoard[batch] Test number 93 leaked 11.0 MB. The name is test_TensorBoard[epoch] Test number 141 leaked 20.0 MB. The name is test_rnn
theano_27_gc_pid_5028 Test number 0 leaked 39.0 MB. The name is test_layer_trainability_switch Test number 4 leaked 21.0 MB. The name is test_sequential_temporal_sample_weights Test number 11 leaked 11.0 MB. The name is test_functional_model_saving Test number 91 leaked 12.0 MB. The name is test_in_top_k Test number 164 leaked 88.0 MB. The name is test_convolutional_recurrent
theano_3.6_gc_pid_5091 Test number 1 leaked 26.234375 MB. The name is test_model_trainability_switch Test number 2 leaked 7.52734375 MB. The name is test_masking Test number 11 leaked 8.48828125 MB. The name is test_sequential_model_saving_2 Test number 99 leaked 12.07421875 MB. The name is test_in_top_k Test number 174 leaked 57.9140625 MB. The name is test_convolutional_recurrent
theano_3.6_gc_pid_5094 Test number 0 leaked 25.21875 MB. The name is test_layer_trainability_switch Test number 4 leaked 16.9453125 MB. The name is test_sequential_temporal_sample_weights Test number 97 leaked 30.34765625 MB. The name is test_TensorBoard[batch] Test number 143 leaked 7.71484375 MB. The name is test_gradient Test number 146 leaked 13.90625 MB. The name is test_rnn
cntk_27_gc_pid_6848 Test number 18 leaked 4.0 MB. The name is test_saving_model_with_long_layer_names Test number 43 leaked 5.0 MB. The name is test_orthogonal[CONV] Test number 96 leaked 34.0 MB. The name is test_TensorBoard[batch] Test number 100 leaked 5.0 MB. The name is test_TensorBoard_convnet Test number 167 leaked 17.0 MB. The name is test_convolutional_recurrent
cntk_27_gc_pid_6851 Test number 4 leaked 9.0 MB. The name is test_sequential_temporal_sample_weights Test number 97 leaked 27.0 MB. The name is test_in_top_k Test number 288 leaked 5.0 MB. The name is test_dropout[LSTM] Test number 315 leaked 6.0 MB. The name is test_Bidirectional Test number 340 leaked 10.0 MB. The name is test_plot_sequential_embedding
cntk_36_gc_pid_6968 Test number 2 leaked 3.2265625 MB. The name is test_masking Test number 4 leaked 4.234375 MB. The name is test_sequential_sample_weights Test number 49 leaked 5.31640625 MB. The name is test_orthogonal[CONV] Test number 102 leaked 32.3359375 MB. The name is test_TensorBoard[batch] Test number 106 leaked 4.4765625 MB. The name is test_TensorBoard_convnet
cntk_36_gc_pid_6971 Test number 0 leaked 2.75 MB. The name is test_layer_trainability_switch Test number 4 leaked 10.35546875 MB. The name is test_sequential_temporal_sample_weights Test number 42 leaked 5.02734375 MB. The name is test_orthogonal[FC] Test number 97 leaked 19.2890625 MB. The name is test_in_top_k Test number 209 leaked 15.64453125 MB. The name is test_convolutional_recurrent