tensorflow / tfjs

A WebGL accelerated JavaScript library for training and deploying ML models.
https://js.tensorflow.org
Apache License 2.0
18.37k stars 1.92k forks source link

Operations with variable tensor sizes cause GPU Memory leaks #604

Closed justadudewhohacks closed 5 years ago

justadudewhohacks commented 6 years ago

TensorFlow.js version

Browser version

Describe the problem or feature request

Running operations with variable input tensor sizes causes GPU memory leaks (not tracked by tf.memory stats, but can be tracked using chrome task manager for example):

for (let i = 0; i < iterations; i++) {
  const height = Math.floor(Math.random() * maxTensorSize)
  const width = Math.floor(Math.random() * maxTensorSize)

  console.log(height, width)

  const t1 = tf.ones([height, width])
  const t2 = tf.ones([height, width])

  // do something
  const sum = t1.add(t2)

  t1.dispose()
  t2.dispose()
  sum.dispose()

  await tf.nextFrame()

  console.log(tf.memory())
}

Code to reproduce the bug / link to feature request

https://github.com/justadudewhohacks/tfjs-tensor-size-memoryleak-issue

Lewuathe commented 6 years ago

@justadudewhohacks Thank you so much for the detail report and sample application.

But in my environment, the memory leak was not observed in Chrome task manager. screen shot 2018-08-17 at 16 57 33

Even I ran the application several times, memory footprint was kept around 100MB. I launched the sample application according to README and checked the Chrome task manager.

justadudewhohacks commented 6 years ago

Hi @Lewuathe,

Thanks for reviewing this. I should have mentioned that you have to toggle the GPU memory tab in the task manager:

gpu-mem-leak

I could reproduce this on my desktop machine and laptop (both AMD GPUs + latest version of chrome), on an Intel GPU, as well as on my android device.

After some time the browser throws an exception saying WebGL context lost. Mobile chrome on android crashes almost immediately.

Hope this helps.

justadudewhohacks commented 6 years ago

In case someone is facing the same issue, when training an image classifier or an object detector, you can mitigate that issue by resizing your images to a fixed input size, before calling tf.fromPixels and instead of doing tensor operations for padding and resizing:

export function imageToSquare(img: HTMLImageElement | HTMLCanvasElement, inputSize: number): HTMLCanvasElement {

  const dims = img instanceof HTMLImageElement 
    ? { width: img.naturalWidth, height: img.naturalHeight }
    : img 
  const scale = inputSize / Math.max(dims.height, dims.width)
  const width = scale * dims.width
  const height = scale * dims.height

  const targetCanvas = document.createElement('canvas')
  targetCanvas .width = inputSize
  targetCanvas .height = inputSize
  targetCanvas.getContext('2d').drawImage(img, 0, 0, width, height)

  return targetCanvas
}
nsthorat commented 5 years ago

Ah yes, this is because we cache textures based on their physical shape, you are basically purposefully getting cache misses every single time. We've found that that's usually pretty rare. Resizing to a fixed input size will absolutely fix the problem :)

Just curious, why was your canvas changing size all the time in practice?

justadudewhohacks commented 5 years ago

Thanks for clarification, I guess that explains it. You are probably right, in most cases the input size of tensors should be fixed anyways.

I was facing the issue when training models on images, which where different in size. I was resizing each image with tf.resizeBilinear, which was causing these memory leaks.

Also this was an issue in face-api.js, since you would first run your images through an object detector, which returns multiple bounding boxes of different sizes and extract sub images from these regions for further classification (for instance face landmark detection or computing a face descriptor). This was also a performance issue resulting in flaky inference times, since I guess the graph was recompiled everytime for different input shapes?

However, by now I am using the code snippet I posted above for resizing, which works pretty well. So this is not really an issue from my side anymore.

nsthorat commented 5 years ago

Ah yeah, if the output tensors are variably sized, we'll possibly have to recompile the shaders every time.

If you can provide a simple standalone HTML page that shows the issue we can look into making it faster (it's possible we can do things like upload the shape as a uniform to avoid recompilation).

cc @annxingyuan

rthadur commented 5 years ago

Automatically closing due to lack of recent activity. Please update the issue when new information becomes available, and we will reopen the issue. Thanks!

RohanGautam commented 5 years ago

This is happening to me too! For me, i've pinpointed it down to the following line in my code: img = tf.browser.fromPixels(webcamElement); where webcamElement is a frame from the webcam, all the major webcam setup is from google's example here.

I'm resizing img and passing it though a CNN. Even when I dont resize it with tf.js, and change the shape of the webcam frame itself, the issue persists.

Memory leak is clearly the issue as tf.memory() tells me the memory increases very rapidly

I have tried:

Let me know if there is any other info you'd like me to provide! I'm kinda stumped about this issue at the moment.

RohanGautam commented 5 years ago

UPDATE: I'm currently able to get it working by calling tf.dispose() on every tensor after i'm done using it. Seems to work for now, but it's kinda janky. Hope it helps someone though!

nsthorat commented 5 years ago

@RohanGautam, do you think you could post the code you are having problems with?

RohanGautam commented 5 years ago

sure! I've boiled it down to the following minimum code required to reproduce the error: index.html :

<html>

<head>
    <meta charset="UTF-8">
    <title>MemLeak</title>
    <!-- Load the latest version of TensorFlow.js -->
    <script src="https://unpkg.com/@tensorflow/tfjs"></script>
</head>

<body>
    <video autoplay playsinline muted id="webcam" width="250" height="250"></video>
    <!-- Load index.js after the content of the page -->
    <script src="index.js"></script>
</body>

</html>

index.js :

const webcamElement = document.getElementById('webcam');

async function app() {

    await setupWebcam();
    while (true) {
        //!! source of leak !!
        const img = tf.browser.fromPixels(webcamElement);
        // Doing stuff with the image//
        console.log(tf.memory())
        await tf.nextFrame();
    }
}

async function setupWebcam() {
    return new Promise((resolve, reject) => {
        const navigatorAny = navigator;
        navigator.getUserMedia = navigator.getUserMedia ||
            navigatorAny.webkitGetUserMedia || navigatorAny.mozGetUserMedia ||
            navigatorAny.msGetUserMedia;
        if (navigator.getUserMedia) {
            navigator.getUserMedia({ video: true },
                stream => {
                    webcamElement.srcObject = stream;
                    webcamElement.addEventListener('loadeddata', () => resolve(), false);
                },
                error => reject());
        } else {
            reject();
        }
    });
}

app();
nsthorat commented 5 years ago

Ah so regular tf.Tensors not tf.Variables. Check out the guide here for why that is not a bug: https://www.tensorflow.org/js/guide/tensors_operations#memory

dhasegan commented 4 years ago

@RohanGautam did you get to fix your problem? I am having the same issue and tf.dispose and tf.tidy does not work. My specs:

Browser:

RohanGautam commented 4 years ago

@dhasegan

@RohanGautam did you get to fix your problem? I am having the same issue and tf.dispose and tf.tidy does not work. My specs:

* "@tensorflow/tfjs": "1.7.4"

Browser:

* Chrome Version 81.0.4044.138 (Official Build) (64-bit)

Yeah I did! Was working on it a long while back, so had to dig up the archives. It was fixed in this commit in my personal project.

Basically involved disposing everything, including intermediate products of computation.

const img = tf.browser.fromPixels(webcamElement);
const resizedImg = tf.image.resizeBilinear(img, [150, 150])
const batchedImage = resizedImg.expandDims(0);
//--------disposing intermediate products too-------------//
tf.dispose(img);
tf.dispose(resizedImg);
tf.dispose(batchedImage);
console.log(tf.memory());

But you say dispose didn't work for you :/ I'd suggest console.log(tf.memory())-ing in your intermediate steps and narrowing down where it's happening.

dhasegan commented 4 years ago

tf.memory() is not increasing for me. my input has varying sizes as well and each new size there is a new shader that is created and cached in the TFJS library: https://github.com/tensorflow/tfjs/issues/3061

There is no cache purge so it slowly accumulates GPU memory (as seen in the Chrome Task Manager). You might reach a similar issue if you have different sizes for webcamElement