Open mrtnlowry opened 8 months ago
@mrtnlowry , Could you please provide the Keras 3 code you are trying with to reproduce the reported behavior, and are you using tensorflow 2.15 as only backend or using it any part of the code?
Apologies for the delay. It took a little time to linearise the code for easier reading. It now runs the same model but with synthesized data. It only uses Tensorflow 2.15 as backend. Unfortunately it doesn't actually learn, but that's another problem I need to solve.
I've encountered the identical issue apparently confined to the call to model.fit() in the replay() function in file agent.py at https://github.com/DeepNeuralAI/RL-DeepQLearning-Trading/blob/master/src/.
replay() is called from line 80 in methods.py in the same folder and the app runs without issue if the call is commented out.
Updated: replacing keras model class in the DDQN project with a makeshift version of my own (i.e., that doesn't use keras API) at the moment seems to resolve the apparent memory leak issue, but it's early days.
Updated to add: now satisfied that the "issue" I encountered is not a bug, especially not a leak, but a documented feature:
https://www.tensorflow.org/api_docs/python/tf/keras/backend/clear_session
For large jobs where only the weights need to be preserved a work around is to periodically save the model to disk, clear the keras session, reload the saved model and resume.
Just want to confirm that I still consider the issue in my case to be a memory leak. I admit some of the code I uploaded is quite inefficient but it still runs to completion in <15min in <1GB of memory when run_eagerly=False. With run_eagerly=True it never completes as it rapidly runs out of memory. Running on Windows 11 and monitored with Task Manager it's easy to see memory being requested with each batch but never released.
Raising this issue again here as it still seems to be present in the current code base.
The data in my model is tiny (<100MB) yet when I try to train the model it very quickly uses ALL 32GB of memory on my device. This can happen even in the first epoch at which point Python crashes. Sometimes GC kicks in and relieves the problem for a while but eventually OOM happens. The model is with run_eagerly=True, has a custom loss function and uses Adam as optimizer. I've included the MemoryUsageCallback as suggested in the original issue to enable when the "leak" occurs. The memory being requested amounts to a Tensor of 2048 * 8381 dtype=float32 -> 64MB, so not huge.
Here's an example run:
Hope someone can find the cause.