keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.69k stars 19.43k forks source link

High memory consumption with model.fit in TF 2.15 + Keras 3.0.2 #19071

Open mrtnlowry opened 8 months ago

mrtnlowry commented 8 months ago

Raising this issue again here as it still seems to be present in the current code base.

The data in my model is tiny (<100MB) yet when I try to train the model it very quickly uses ALL 32GB of memory on my device. This can happen even in the first epoch at which point Python crashes. Sometimes GC kicks in and relieves the problem for a while but eventually OOM happens. The model is with run_eagerly=True, has a custom loss function and uses Adam as optimizer. I've included the MemoryUsageCallback as suggested in the original issue to enable when the "leak" occurs. The memory being requested amounts to a Tensor of 2048 * 8381 dtype=float32 -> 64MB, so not huge.

Here's an example run:

D:\Data\Python\main-tf>python train_models.py

Memory usage before data generation: 0.2597808837890625GB
  Loading  data/train/geo_prior_train_SA_only.csv
  Number of unique classes 8381
  subsampling (up to) 1000 per class for the training set
  final training set size: 829215
  Shuffling data ...
Memory usage after data generation: 0.3338737487792969GB

Memory usage before model generation: 0.3338737487792969GB
Memory usage  after model generation: 0.35327911376953125GB
**Epoch 0**
Memory usage on epoch begin: 0.35416412353515625GB
Epoch 1/6
405/405 ━━━━━━━━━━━━━━━━━━━━ 0s 913ms/step - loss: 2.5723
Shuffling data ...
Memory usage on epoch end:   24.248931884765625GB

**Epoch 1**
Memory usage on epoch begin: 24.249061584472656GB
Epoch 2/6
227/405 ━━━━━━━━━━━━━━━━━━━━ 2:44 924ms/step - loss: 2.8133

2024-01-19 12:32:59.494962: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at scatter_nd_op.cc:212 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[2048,8381] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
Traceback (most recent call last):
  File "D:\Data\Python\main-tf\train_models.py", line 101, in <module>
    model.fit(train_dataset, batch_size=params['batch_size'], epochs=params['num_epochs'], callbacks=mem_use_callback)
  File "C:\Users\Martin\anaconda3\envs\inat_geomodel\Lib\site-packages\keras\src\utils\traceback_utils.py", line 123, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "D:\Data\Python\main-tf\models.py", line 145, in compute_loss
    loss_pos = ops.scatter_update(loss_pos, indices, values)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.ResourceExhaustedError: {{function_node __wrapped__TensorScatterUpdate_device_/job:localhost/replica:0/task:0/device:CPU:0}} OOM when allocating tensor with shape[2048,8381] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu [Op:TensorScatterUpdate] name:
2024-01-19 12:33:00.159349: W tensorflow/core/kernels/data/generator_dataset_op.cc:108] Error occurred when finalizing GeneratorDataset iterator: FAILED_PRECONDITION: Python interpreter state is not initialized. The process may be terminated.
         [[{{node PyFunc}}]]

Hope someone can find the cause.

sachinprasadhs commented 8 months ago

@mrtnlowry , Could you please provide the Keras 3 code you are trying with to reproduce the reported behavior, and are you using tensorflow 2.15 as only backend or using it any part of the code?

mrtnlowry commented 8 months ago

Apologies for the delay. It took a little time to linearise the code for easier reading. It now runs the same model but with synthesized data. It only uses Tensorflow 2.15 as backend. Unfortunately it doesn't actually learn, but that's another problem I need to solve.

Linearised4GitHub.zip

bcnichols commented 8 months ago

I've encountered the identical issue apparently confined to the call to model.fit() in the replay() function in file agent.py at https://github.com/DeepNeuralAI/RL-DeepQLearning-Trading/blob/master/src/.

replay() is called from line 80 in methods.py in the same folder and the app runs without issue if the call is commented out.

Updated: replacing keras model class in the DDQN project with a makeshift version of my own (i.e., that doesn't use keras API) at the moment seems to resolve the apparent memory leak issue, but it's early days.

bcnichols commented 8 months ago

Updated to add: now satisfied that the "issue" I encountered is not a bug, especially not a leak, but a documented feature:

https://www.tensorflow.org/api_docs/python/tf/keras/backend/clear_session

For large jobs where only the weights need to be preserved a work around is to periodically save the model to disk, clear the keras session, reload the saved model and resume.

mrtnlowry commented 8 months ago

Just want to confirm that I still consider the issue in my case to be a memory leak. I admit some of the code I uploaded is quite inefficient but it still runs to completion in <15min in <1GB of memory when run_eagerly=False. With run_eagerly=True it never completes as it rapidly runs out of memory. Running on Windows 11 and monitored with Task Manager it's easy to see memory being requested with each batch but never released.