High memory consumption with model.fit in TF 2.x

jvishnuvardhan commented 2 years ago

Moved from Tensorflow repository https://github.com/tensorflow/tensorflow/issues/40942

@gdudziuk opened this issue in TF repo.

System information

Have I written custom code: Yes OS Platform and Distribution: CentOS Linux 7 Mobile device: Not verified on mobile devices TensorFlow installed from: binary, via pip install tf-nightly TensorFlow version: 2.5.0-dev20200626 Python version: 3.6.8 CUDA/cuDNN version: 10.1 / 7 GPU model and memory: Tesla V100 32 GB Describe the current behavior

Model training with the Keras API consumes high amount of system memory. It looks like the memory used by model.fit is proportional to the size of the training data provided as numpy arrays, with the proportionality constant being approximately 1. In other words, if the numpy arrays x and y are, say, 8 GB in total, then model.fit(x,y,...) will use another 8 GB (plus some overhead). So the memory usage by model.fit uses is twice the data size (plus some overhead).

The same concerns the validation data. If validation data are passed as numpy arrays to model.fit via the argument validation_data, then the memory use of model.fit seems to duplicate the size of the validation data arrays.

The described effect is also present if I wrap the numpy arrays containing the data in TF Datasets.

In the code attached below, one may change the variable K to vary the size of the data and test the above described behavior. It is straightforward to estimate the data size (e.g. with K=5000 the data arrays in the below code should be ca. 7.32 GB in total). The whole Python process associated with this code uses approximately twice this much RAM plus some overhead independent of the data size. One may comment out the line containing model.fit to check that it is the point at which the high memory consumption starts.

Describe the expected behavior

It would be reasonable to expect that the memory usage by the test code was approximately the data size plus some overhead independent of the data size (not twice the data size plus overhead).

A bit of history

This is a continuation of the issue #35030, concerning TF 2.0 and 2.1. I opened the latter issue in December 2019 and now @karmel have stated that that issue is very long and asked me to test if the issue persists in TF-nightly and open a new issue if necessary. So yes, the problem persists, and here I open a new issue.

The problem appeared first in the release 2.0.0-rc0. In the earlier releases up to 2.0.0-b1 inclusive the memery usage by the below test code was ca. the size of the data arrays plus an overhead independent of the data size. Starting from 2.0.0-rc0 it became twice the data size plus overhead and it was true at least until 2.1.0.

Next, in 2.2.0, the situation changed a bit:

When using numpy arrays to pass data to model.fit, there was a memory leak about 0.5 x data size in each epoch. In other words, if the size of the data arrays was ca. 8 GB, then the memory usage was increasing ca. 4 GB each epoch. When wrapping the data arrays in TF datasets and then passing to model.fit, then the behavior was the same in TF 2.2 as in 2.1 and 2.0, namely the memory usage was twice the data size plus overhead. Now, in the nightly release 2.5.0-dev20200626 we are back to the previous situation, namely the memory usage is twice the data size plus overhead, regardless of whether numpy arrays or datasets are used to pass the data to model.fit.

An important note on reproducibility

The issue has occurred to be not reproducible in colab! In #35030, I reported the issue for my local machine and some other participants also managed to reproduce it locally but not in colab. Some were trying to reproduce it in colab without success. Similarly, the results I report now are not from colab.

Also, for some reason the issue cannot be captured when using libmemusage.so to measure the memory usage. To capture the issue, I use ps au in Linux terminal or Python module psutil.

Standalone code to reproduce the issue

Since this issue is in fact a continuation of #35030, I use the same test code here.

import tensorflow as tf
import numpy as np

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Lambda, Conv2D

print("Tensorflow version: {}".format(tf.__version__),flush=True)

K = 5000 # Number of images
N = 512  # Image size

MAX_SIGNAL = 5000 # The values of the training data range from 0 to this

def build_model():
  '''Create a simple test model.'''

  inputs = Input((N,N,1))
  s = Lambda(lambda x: x / MAX_SIGNAL) (inputs)
  s = Conv2D(1, (3, 3), activation='sigmoid', padding='same')(s)
  outputs = s

  return Model(inputs=[inputs], outputs=[outputs])

# Generate some random data
x_train = np.random.randint(MAX_SIGNAL+1,size=(K,N,N,1),dtype=np.uint16) # Should be 2 560 000 kB
y_train = np.random.randint(1+1         ,size=(K,N,N,1),dtype=np.bool)   # Should be 1 280 000 kB
x_val   = np.random.randint(MAX_SIGNAL+1,size=(K,N,N,1),dtype=np.uint16) # Should be 2 560 000 kB
y_val   = np.random.randint(1+1         ,size=(K,N,N,1),dtype=np.bool)   # Should be 1 280 000 kB
# In total, the above arrays should be 7 680 000 kB

model = build_model()

optimizer = tf.keras.optimizers.Adam()
loss = tf.keras.losses.BinaryCrossentropy()

model.compile(optimizer=optimizer, loss=loss)
model.fit(x=x_train, y=y_train, validation_data=(x_val,y_val), batch_size=8, epochs=10)
The above is meant to reproduce the issue with data passed to model.fit as numpy arrays. To test the behavior with TF datasets, replace the last line with the following:

ds_train = tf.data.Dataset.from_tensor_slices((x_train,y_train)).batch(8)
ds_val = tf.data.Dataset.from_tensor_slices((x_val,y_val)).batch(8)
model.fit(ds_train, validation_data=ds_val, epochs=10)

gdudziuk commented 2 years ago

Than you very much for doing this. Let me post also the slightly modified test code with built-in memory measurements, which may be more convenient:

import tensorflow as tf
import numpy as np
import psutil
import os

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Lambda, Conv2D
from tensorflow.keras.callbacks import Callback

print("Tensorflow version: {}".format(tf.__version__),flush=True)

K = 5000 # Number of images
N = 512  # Image size

MAX_SIGNAL = 5000 # The values of the training data range from 0 to this

class MemoryUsageCallback(Callback):
  '''Monitor memory usage on epoch begin and end.'''

  def on_epoch_begin(self,epoch,logs=None):
    print('**Epoch {}**'.format(epoch))
    print('Memory usage on epoch begin: {}'.format(psutil.Process(os.getpid()).memory_info().rss))

  def on_epoch_end(self,epoch,logs=None):
    print('Memory usage on epoch end:   {}'.format(psutil.Process(os.getpid()).memory_info().rss))

def build_model():
  '''Create a simple test model.'''

  inputs = Input((N,N,1))
  s = Lambda(lambda x: x / MAX_SIGNAL) (inputs)
  s = Conv2D(1, (3, 3), activation='sigmoid', padding='same')(s)
  outputs = s

  return Model(inputs=[inputs], outputs=[outputs])

# Generate some random data
x_train = np.random.randint(MAX_SIGNAL+1,size=(K,N,N,1),dtype=np.uint16) # Should be 2 560 000 kB
y_train = np.random.randint(1+1         ,size=(K,N,N,1),dtype=np.bool)   # Should be 1 280 000 kB
x_val   = np.random.randint(MAX_SIGNAL+1,size=(K,N,N,1),dtype=np.uint16) # Should be 2 560 000 kB
y_val   = np.random.randint(1+1         ,size=(K,N,N,1),dtype=np.bool)   # Should be 1 280 000 kB
# In total, the above arrays should be 7 680 000 kB

model = build_model()

callbacks = [MemoryUsageCallback()]
optimizer = tf.keras.optimizers.Adam()
loss = tf.keras.losses.BinaryCrossentropy()

model.compile(optimizer=optimizer, loss=loss)
model.fit(x=x_train, y=y_train, validation_data=(x_val,y_val), batch_size=8, epochs=10, callbacks=callbacks, verbose=0)

The above is meant to reproduce the issue with data passed to model.fit as numpy arrays. To test the behavior with TF datasets, replace the last line with the following:

ds_train = tf.data.Dataset.from_tensor_slices((x_train,y_train)).batch(8)
ds_val = tf.data.Dataset.from_tensor_slices((x_val,y_val)).batch(8)
model.fit(ds_train, validation_data=ds_val, batch_size=8, epochs=10, callbacks=callbacks, verbose=0)

gdudziuk commented 2 years ago

Also, anybody investigating the root of the problem, be sure to note the issue tensorflow/tensorflow#35030, where @mihaimaruseac has tracked the point at which the bug has been introduced.

qlzh727 commented 2 years ago

Triage notes: This is a long stand performance issue, and we should have someone look into this.

rchao commented 2 years ago

Hello @gdudziuk, it'd be helpful if you could help us with the following,

Is this a regression from any previous TF versions?
If you construct a custom training loop that achieves the same thing as this Model.fit, are you seeing smaller memory usage?

Thanks!

google-ml-butler[bot] commented 2 years ago

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

gdudziuk commented 2 years ago

Not stalled. I will try the custom training loop next week.

google-ml-butler[bot] commented 2 years ago

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

gdudziuk commented 2 years ago

I have checked that the issue is still there in TF 2.8 with Python 3.8.

gdudziuk commented 2 years ago

Now, let me answer @rchao's questions.

First, yes, this is a regression from previous versions. As stated in the initial post of this issue, the issue first occurred in 2.0.0-rc0. In TF 1.x and pre-releases 2.0.0-a0, 2.0.0-b0 or 2.0.0-b1 everything was fine.

Interestingly, in TF 2.2 the situation changed a bit for numpy-based data, but not for tf Datasets (see the section "A bit of history" in the initial post for details) to get back to the previous bad situation in the next releases. This suggests that the bug is occasionally touched in the regular development.

Note: If you are tired reading the lengthy initial post, try reading the initial post of tensorflow/tensorflow#40942. It is the same but has some formatting that was lost during the copy-paste.

Please also be sure to check tensorflow/tensorflow#35030, in particular this https://github.com/tensorflow/tensorflow/issues/35030#issuecomment-571835841 and this https://github.com/tensorflow/tensorflow/issues/35030#issuecomment-571845119 answer. Therein @mihaimaruseac tracks the moment at which the bug was introduced, with his main conclusions being:

So, it looks like there was one memory consumption error introduced between 2019/08/02 and 2019/08/03 (for about 4GB) and another one somewhere in between 2019/08/15 and 2019/09/01

and

The second memory increase happens on 2019/09/16. It seems in August we introduced both of these bugs.

gdudziuk commented 2 years ago

Second, the custom training loop. I have never used custom training loops in TF before but it turned out to be quite straightforward by modifying the examples from official tutorials.

So yes, unfortunately, the issue occurs also with a custom training loop. The memory usage is slightly lower than with model.fit but it looks like the general overhead is smaller, with the core problem that the memory usage is about 2x the data size still present there.

gdudziuk commented 2 years ago

The code with the custom training loop:

import tensorflow as tf
import numpy as np
import psutil
import os

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Lambda, Conv2D
from tensorflow.keras.callbacks import Callback

print("Tensorflow version: {}".format(tf.__version__),flush=True)

K = 5000 # Number of images
N = 512  # Image size

MAX_SIGNAL = 5000 # The values of the training data range from 0 to this

class MemoryUsageCallback(Callback):
  '''Monitor memory usage on epoch begin and end.'''

  def on_epoch_begin(self,epoch,logs=None):
    print('**Epoch {}**'.format(epoch))
    print('Memory usage on epoch begin: {}'.format(psutil.Process(os.getpid()).memory_info().rss))

  def on_epoch_end(self,epoch,logs=None):
    print('Memory usage on epoch end:   {}'.format(psutil.Process(os.getpid()).memory_info().rss))

def build_model():
  '''Create a simple test model.'''

  inputs = Input((N,N,1))
  s = Lambda(lambda x: x / MAX_SIGNAL) (inputs)
  s = Conv2D(1, (3, 3), activation='sigmoid', padding='same')(s)
  outputs = s

  return Model(inputs=[inputs], outputs=[outputs])

# Generate some random data
x_train = np.random.randint(MAX_SIGNAL+1,size=(K,N,N,1),dtype=np.uint16) # Should be 2 560 000 kB
y_train = np.random.randint(1+1         ,size=(K,N,N,1),dtype=np.bool)   # Should be 1 280 000 kB
x_val   = np.random.randint(MAX_SIGNAL+1,size=(K,N,N,1),dtype=np.uint16) # Should be 2 560 000 kB
y_val   = np.random.randint(1+1         ,size=(K,N,N,1),dtype=np.bool)   # Should be 1 280 000 kB
# In total, the above arrays should be 7 680 000 kB

model = build_model()

callbacks = [MemoryUsageCallback()]
optimizer = tf.keras.optimizers.Adam()
loss = tf.keras.losses.BinaryCrossentropy()

ds_train = tf.data.Dataset.from_tensor_slices((x_train,y_train)).batch(8)
ds_val = tf.data.Dataset.from_tensor_slices((x_val,y_val)).batch(8)

epochs = 10
for epoch in range(epochs):

    for callback in callbacks:
        callback.on_epoch_begin(epoch)

    for step, (x_batch_train, y_batch_train) in enumerate(ds_train):

        with tf.GradientTape() as tape:
            logits = model(x_batch_train, training=True)
            loss_value = loss(y_batch_train, logits)

        grads = tape.gradient(loss_value, model.trainable_weights)
        optimizer.apply_gradients(zip(grads, model.trainable_weights))

    for callback in callbacks:
        callback.on_epoch_end(epoch)

gdudziuk commented 2 years ago

And the output is below (TF 2.8, Python 3.8). The memory usage is slightly lower than with model.fit (compare to the results posted in tensorflow/tensorflow#40942) but still ca. 2x the data size.

**Epoch 0**
Memory usage on epoch begin: 16050802688
Memory usage on epoch end:   16578965504
**Epoch 1**
Memory usage on epoch begin: 16578965504
Memory usage on epoch end:   16533483520
**Epoch 2**
Memory usage on epoch begin: 16533483520
Memory usage on epoch end:   16560668672
**Epoch 3**
Memory usage on epoch begin: 16560668672
Memory usage on epoch end:   16601567232
**Epoch 4**
Memory usage on epoch begin: 16601567232
Memory usage on epoch end:   16568754176
**Epoch 5**
Memory usage on epoch begin: 16568754176
Memory usage on epoch end:   16574918656
**Epoch 6**
Memory usage on epoch begin: 16574918656
Memory usage on epoch end:   16541495296
**Epoch 7**
Memory usage on epoch begin: 16541495296
Memory usage on epoch end:   16532393984
**Epoch 8**
Memory usage on epoch begin: 16532393984
Memory usage on epoch end:   16557969408
**Epoch 9**
Memory usage on epoch begin: 16557969408
Memory usage on epoch end:   16557981696

google-ml-butler[bot] commented 2 years ago

Closing as stale. Please reopen if you'd like to work on this further.

google-ml-butler[bot] commented 2 years ago

Are you satisfied with the resolution of your issue? Yes No

gdudziuk commented 2 years ago

How is that possible that this issue has been qualified as stalled? I have answered @rchao 's questions and was waiting for response.

jvishnuvardhan commented 2 years ago

Reopened it. Sorry for the inconvenience. Removed the label. Thanks

kingmacth commented 2 years ago

Has anyone continued to update this issue? I also encountered this problem.

iancolwell commented 1 year ago

Has there been any resolution/progress/update to this issue?

deansher commented 1 year ago

An excessive-memory-use-in-Model.fit problem that I am struggling with fits well with this one. It appears that the total size of the Dataset I process in Model.fit drives total CPU memory consumption, even though both the training and validation Dataset's should be scanned sequentially beyond the modest prefetch and shuffle buffers.

Although various Tensorflow team members have put substantial effort into this issue and its predecessors, it seems as though, overall, Tensorflow lives with it as either unsolvable or not super-important?

nilsgumpfer commented 1 year ago

A possible work-around that was suitable in my case was to ad a gc.collect() to the on_epoch_end() of a custom callback, such as the above:

import gc

class MemoryUsageCallbackExtended(Callback):
  '''Monitor memory usage on epoch begin and end, collect garbage'''

  def on_epoch_begin(self,epoch,logs=None):
    print('**Epoch {}**'.format(epoch))
    print('Memory usage on epoch begin: {}'.format(psutil.Process(os.getpid()).memory_info().rss))

  def on_epoch_end(self,epoch,logs=None):
    print('Memory usage on epoch end:   {}'.format(psutil.Process(os.getpid()).memory_info().rss))
    gc.collect()

Hope this helps until they manage to fix this internally

tommydino93 commented 1 year ago

Facing the same issue with TF 2.4, and python 3.8.

RAM memory linearly increases across epochs. I tried using gc.collect() as suggested by @nilsgumpfer , but the problem persists.

Any update on this? Thanks in advance!

rchao commented 1 year ago

We have identified a possible root cause of memory leaks, and informed core TensorFlow team to take a look. In particular, the feature of running eager ops as functions. If you can build TensorFlow yourself, you can try whether setting this line to False can help in any way. If it doesn't, this may be of other causes.

tommydino93 commented 1 year ago

Hi @rchao. Thanks for you answer. I couldn't find that line on my TF 2.4 version. See screenshot below:

screenshot_tf_24

tilakrayal commented 11 months ago

Hello, Thank you for reporting an issue.

We're currently in the process of migrating the new Keras 3 code base from keras-team/keras-core to keras-team/keras. Consequently, This issue may not be relevant to the Keras 3code base. After the migration is successfully completed, feel free to reopen this issue at keras-team/keras if you believe it remains relevant to the Keras 3 code base. If instead this issue is a bug or security issue in legacy tf.keras, you can instead report a new issue at keras-team/tf-keras, which hosts the TensorFlow-only, legacy version of Keras.

To know more about Keras 3, please take a look at https://keras.io/keras_core/announcement/. Thank you!

google-ml-butler[bot] commented 11 months ago

Are you satisfied with the resolution of your issue? Yes No

keras-team / keras

High memory consumption with model.fit in TF 2.x #15887