keras-team / tf-keras

The TensorFlow-specific implementation of the Keras API, which was the default Keras from 2019 to 2023.
Apache License 2.0
64 stars 31 forks source link

Memory leak when using tf.Model and tf.Model.fit() in a loop. clear_session() does not help #286

Open alessiomora opened 1 year ago

alessiomora commented 1 year ago

System information.

Describe the problem. Memory usage steadily increases when using tf.Model and tf.Model.fit() in a loop, and leads to Out Of Memory exception saturating the memory eventually. clear_session() does not help. The same code with TF version == 2.9.2 has an almost constant memory usage instead, and works as expected.

Describe the problem clearly here. Be sure to convey here why it's a bug in Keras or why the requested feature is needed.

Describe the current behavior. Memory usage steadily increases when using tf.Model and tf.Model.fit() in a loop, and leads to Out Of Memory exception saturating the memory eventually.

Describe the expected behavior. The memory usage remains almost the same.

Standalone code to reproduce the issue.

import tensorflow as tf
import time

class MyModel(tf.keras.Model):

  def __init__(self):
    super().__init__()
    self.dense1 = tf.keras.layers.Dense(1000, activation=tf.nn.relu)
    self.dense2 = tf.keras.layers.Dense(10000, activation=tf.nn.softmax)
    self.dense3 = tf.keras.layers.Dense(10000, activation=tf.nn.softmax)
    self.dense4 = tf.keras.layers.Dense(1000, activation=tf.nn.softmax)

  def call(self, inputs):
    x = self.dense1(inputs)
    x = self.dense2(x)
    x = self.dense3(x)
    x = self.dense4(x)
    return x

for r in range(0, 10000):
    model = MyModel()
    ds = tf.data.Dataset.from_tensor_slices((tf.random.uniform((64*4, 1000)), tf.ones((64*4))))
    model.compile(optimizer='sgd', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))

    model.fit(ds.batch(64))
    tf.keras.backend.clear_session()
    time.sleep(3)
    print("round: ", r)

Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.

Source code / logs.

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

sushreebarsa commented 1 year ago

@alessiomora Thanks for reporting this issue, I was able to replicate this issue on TF v2.11 and latest nightly. but the issue is not reproducible in TF v 2.9. Could you please check the attached gist and confirm the same? Thank you!

alessiomora commented 1 year ago

@sushreebarsa Yes, I do confirm. As I higlighted in the issue above, with TF v 2.9.2 the issue is not present, so yes the issue is not reproducibile with TF v 2.9.2. Thank you very much for your help

rchao commented 1 year ago

@sampathweb does this appear to have similar symptom comparing to the previous memory leak issues we've seen?

rchao commented 1 year ago

This may be similar to a recent memory leak in evaluation - but just a quick check, if you run 10,000 epochs instead of a loop over Model.fit 10,000 times, do you still see the memory leak?

maxreiss123 commented 1 year ago

Hey, Currently, I have to deal with the same problem. The memory leak is caused by tf.data.Dataset.

alessiomora commented 1 year ago

Hi all, thank you for your help. Is there a solution to this behaviour?

maxreiss123 commented 1 year ago

Unfortunately, I have not found a solution yet. In my case, I use a workaround based on a batch script because when the Python program terminates, all memory is released. So instead of using the for-loop in Python, you can write a for-loop in a batch script, which calls the script containing your fit method. (You just need to find out what the max number of iterations before the leak crashes your program)

alessiomora commented 1 year ago

Hi @sushreebarsa, thank you for your help. Any news on the issue? I do believe that the memory leak is caused by model.fit().

Thanks.

Metcoler commented 1 year ago

Hey @alessiomora, I had a similar issue. I wanted to call model.fit() inside the loop because my dataset was too large. What worked for me, was to do a little bit of cleanup with Python del, gc.collect() combined with tf.keras.backend.clear_session().

I am using TensorFlow 2.12 in Windows WSL2 like recommended in: https://www.tensorflow.org/install/pip

My code:

# Load your model just once before loop

for i in range(1, 1001):
    # Get part of your dataset
    input_set, output_set = DataSet.get_training_data()

    # Train your model with custom batch_size, num_epochs
    history = model.fit(input_set, output_set, validation_data=(input_set, output_set), epochs=num_epochs, batch_size=batch_size)

    # Clean memory after use
    del history
    del input_set
    del output_set
    tf.keras.backend.clear_session()
    gc.collect() #garbage collector collect

    # Save once in a while
    if (i % 100 == 0):
        model.save(f"./checkpoints/AI_checkpoint_{i//60}.h5")

before I added a clean memory section, a program used to make two or three iterations and then crash because memory was full (sometimes VRAM, sometimes GPU memory). This worked for me, hope it will work for you too.

alessiomora commented 1 year ago

Hi @Metcoler, thank you for the suggestion. However, the memory seems to still stadealy increase, and eventually OOM appears. I am sure there is a problem in the .fit() implementation as reported in other Github issues and in other stackoverflow questions.

DanielWicz commented 1 year ago

This may be related to:

https://github.com/tensorflow/tensorflow/issues/50765

miguelalba96 commented 1 year ago

I'm facing similar issues with the GCP pre-built container image with TF 2.12 GPU (europe-docker.pkg.dev/vertex-ai/training/tf-gpu.2-12.py310:latest), in a system with 32 vCPUs, 208GB of RAM and 4 NVIDIA TESLA V100s, this is the chart of RAM usage

Screen Shot 2023-09-11 at 10 50 31

The spikes are the moments in which the validation is performed, my data pipeline consists of loading multiple TFrecords with images and labels, records size goes from (300mb to max 1.8GB)

Unfortunately I cannot disclosure my full code, but this is the order of my tf.data.Dataset operations

dataset = tf.data.Dataset.list_files(tfrecord_files, shuffle=True)
# cycle length here is 100 for the training dataset, None for validation, parsed examples loads the TFecords and parse the content
dataset = dataset.interleave(parser.get_parsed_examples(fn), cycle_length=cycle_length, num_parallel_calls=AUTOTUNE)
dataset = dataset.repeat()
dataset = dataset.batch(batch_size, drop_reminder=True)
# transformation function that normalizes the data
dataset = dataset.map(transform_func, num_parallel_calls=AUTOTUNE)
dataset = dataset.prefetch(AUTOTUNE)

Then on model.fit (which I call 1 time only, not in a loop), you can see on the RAM chart the training intervals being memory efficient (the dataset is huge and the memory is constant and drops at the end of consumption), however there are validation spikes that increase exponentially in memory, any hints or ideas of what might be happening?

I already tried cleaning up with gc.collect() memory after the end of the validation with a callback on_test_end

thank you for your help!

binczakmartin commented 1 year ago

I switched to tensorflow JS to Python, this made me switch back to Javascript 😅

AndreasKarasenko commented 6 months ago

Is there an update on this? This problem still appears in tf 2.14 and has been reported many times.

DushyantSahoo commented 4 months ago

Any update on this?

gff77 commented 3 months ago

any updates?

ghsanti commented 2 months ago

tf.data.Dataset does not seem the source of the leak; this code has no issues:

 import tensorflow as tf

d=128*4
for r in range(0, 10000):             
    ds = tf.data.Dataset.from_tensor_slices((tf.random.uniform((d, 1000)), tf.ones((d))))
    ds.batch(64)

It rather it seems that the way OP wrote the code it's regenerating new weights without garbage collecting it. The memory increases by a multiple of the number of weights on each epoch.

As a another proof, one can train using:

x = np.random.standard_normal((64,1000))
y = np.ones((64,))

and get the same output.

The same happens for Sequential; the number of weights and everything is the same as expected, and in both cases there is a memory leak.

Apparently, it's a Tensorflow issue (see link just below this comment.)