keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.27k stars 19.38k forks source link

Potential memory leak with SymbolicTensors #19058

Open alxhoff opened 5 months ago

alxhoff commented 5 months ago

Hi,

I am posting here as I am unsure if this is a Tensorflow or Keras problem, and I am slowly getting more and more desperate for a solution. I am having a problem that my memory consumption is steadily growing over iterations of my neural architecture search code where hundreds, if not thousands, of Keras models are created and trained. I created an issue last week with Tensorflow here but now looking into the code I am wondering if it is maybe on the Keras end. I don't have a good knowledge of Tensorflow or Keras internals so I am unsure at the moment which, if either, is responsible.

I believe that the problem is with one or more of the SymbolicTensors created when creating a Conv2D layer, they seem to be persistent even after the model has ceased to be used, have been unable to release them using garbage collection.

In the Tensorflow issue I have detailed my versions as well as provided a minimal code example that reproduces the problem.

Any insight would be very very appreciated.

SuryanarayanaY commented 5 months ago

Hi @alxhoff ,

I have tested the code with Keras3 with tensorflow and torch backend and observed memory leakage with both backends but at slower pace that of tf.keras reported.

Attached gist for reference.

Escalating the issue to Dev team.

alxhoff commented 5 months ago

Thank you @SuryanarayanaY!

haifeng-jin commented 2 months ago

Sorry, I found myself do not have enough time to reach this issue. Unassigning myself and put it to another round of issue triage.

jeffcarp commented 2 months ago

Thanks for the detailed repro. I think this is related to the creation of tf.function in the training loop. When I repro on my local machine, I get this message in the logs:

WARNING:tensorflow:5 out of the last 5 calls to <function TensorFlowTrainer.make_train_function.<locals>.one_step_on_iterator at 0x7f50d42bce00> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details.

When I switch to eager execution (which skips tf.function), the memory usage grows much slower:

image

@alxhoff can you try re-running with model.compile(..., run_eagerly=True) and see if that helps?