tensorflow / tensorflow

An Open Source Machine Learning Framework for Everyone
https://tensorflow.org
Apache License 2.0
183.95k stars 74.04k forks source link

Potential memory leak with SymbolicTensor #62783

Open alxhoff opened 5 months ago

alxhoff commented 5 months ago

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

source

TensorFlow version

2.15

Custom code

Yes

OS platform and distribution

Manjaro Linux

Mobile device

No response

Python version

3.11.5

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

12.3.52

GPU model and memory

No response

Current behavior?

Hello,

I am working on a TinyML NAS framework, i.e. throughout the execution of my code, hundreds if not thousands of models are created and trained. I have come across a problem that has been starving my system of memory after a couple of days of execution. By using tracemalloc I have been able to see that the main contributor appears to be the symbolic tensor created when creating a Conv2D layer. Maybe I am missing something basic in terms of garbage collection in my code but over time the demo code will eventually consume all system memory.

I have also tried tf.keras.backend.clear_session() and gc.collect() but neither help.

Any help would be appreciated.

Cheers

Standalone code to reproduce the issue

import gc
import tracemalloc, sys, linecache, os
import numpy as np
import tensorflow as tf
from tensorflow import keras

EPOCHS = 5
BS = 512
TEST_LOOPS = 10000

def start_tracemalloc():
    tracemalloc.start()

def display_top(snapshot, key_type="lineno", limit=5):
    snapshot = snapshot.filter_traces(
        (
            tracemalloc.Filter(False, "<frozen importlib._bootstrap>"),
            tracemalloc.Filter(False, "<unknown>"),
        )
    )
    top_stats = snapshot.statistics(key_type)

    print("Top %s lines" % limit)
    for index, stat in enumerate(top_stats[:limit], 1):
        frame = stat.traceback[0]
        # replace "/path/to/module/file.py" with "module/file.py"
        filename = os.sep.join(frame.filename.split(os.sep)[-2:])
        print(
            "#%s: %s:%s: %.1f KiB" % (index, filename, frame.lineno, stat.size / 1024)
        )
        line = linecache.getline(frame.filename, frame.lineno).strip()
        if line:
            print("    %s" % line)

    other = top_stats[limit:]
    if other:
        size = sum(stat.size for stat in other)
        print("%s other: %.1f KiB" % (len(other), size / 1024))
    total = sum(stat.size for stat in top_stats)
    print("Total allocated size: %.1f KiB" % (total / 1024))

def display_snapshot():
    snapshot = tracemalloc.take_snapshot()
    display_top(snapshot)

def create_model() -> keras.models.Model:
    inputs = keras.layers.Input(shape=(28, 28, 1))
    x = keras.layers.Conv2D(32, kernel_size=(3, 3), padding="valid")(inputs)
    x = keras.layers.MaxPooling2D(pool_size=(2, 2), strides=None)(x)
    x = keras.layers.Flatten()(x)
    x = keras.layers.Dense(128, activation=tf.nn.relu)(x)
    x = keras.layers.Dropout(0.2)(x)
    outputs = keras.layers.Dense(10, activation=tf.nn.softmax)(x)

    model = keras.models.Model(inputs=inputs, outputs=outputs)
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

    return model

def train_model(model, train_images, train_labels):
    model.fit(train_images, train_labels, epochs=EPOCHS, batch_size=BS)

def main() -> int:
    mnist = keras.datasets.mnist
    (train_images, train_labels), (test_images, test_labels) = mnist.load_data()

    train_images = train_images.reshape(train_images.shape[0], train_images.shape[1], train_images.shape[2], 1)
    train_images = train_images.astype(np.float32) / 255.0

    start_tracemalloc()

    for i in range(TEST_LOOPS):
        tf.keras.backend.clear_session()
        gc.collect()

        model = create_model()
        train_model(model, train_images, train_labels)
        display_snapshot()
        del model

if __name__ == '__main__':
    sys.exit(main())

Relevant log output

Epoch 1/5
...
Top 5 lines
#1: <frozen abc>:123: 1856.2 KiB
#2: python3.11/linecache.py:137: 606.4 KiB
    lines = fp.readlines()
#3: framework/ops.py:245: 197.9 KiB
    return pywrap_tf_session.PyTensor.__new__(
#4: <frozen importlib._bootstrap_external>:729: 158.6 KiB
#5: framework/ops.py:1211: 137.3 KiB
    self._gradient_function = None
952 other: 1407.6 KiB
Total allocated size: 4364.1 KiB
Epoch 1/5
...
Top 5 lines
#1: <frozen abc>:123: 1846.9 KiB
#2: python3.11/linecache.py:137: 620.8 KiB
    lines = fp.readlines()
#3: framework/ops.py:245: 382.3 KiB
    return pywrap_tf_session.PyTensor.__new__(
#4: framework/ops.py:1211: 264.9 KiB
    self._gradient_function = None
#5: framework/ops.py:1161: 254.7 KiB
    self = Operation(c_op, SymbolicTensor)
947 other: 2160.5 KiB
Total allocated size: 5530.0 KiB
Epoch 1/5 
...
Top 5 lines
#1: <frozen abc>:123: 1843.9 KiB
#2: python3.11/linecache.py:137: 620.8 KiB
    lines = fp.readlines()
#3: framework/ops.py:245: 568.5 KiB
    return pywrap_tf_session.PyTensor.__new__(
#4: framework/ops.py:1211: 394.1 KiB
    self._gradient_function = None
#5: framework/ops.py:1161: 378.9 KiB
    self = Operation(c_op, SymbolicTensor)
949 other: 2702.3 KiB
Total allocated size: 6508.5 KiB
sushreebarsa commented 5 months ago

@SuryanarayanaY I was able to replicate the issue reported here. Thank you!

SuryanarayanaY commented 5 months ago

Being discussed in Keras. #19058

alxhoff commented 5 months ago

@SuryanarayanaY I posted a similar thing on Keras as I was not sure if it was a TF problem or a Keras problem

alxhoff commented 5 months ago

Any news @SuryanarayanaY? I just have a research paper waiting on this bugfix >.<