tensorflow / tfjs

A WebGL accelerated JavaScript library for training and deploying ML models.
https://js.tensorflow.org
Apache License 2.0
18.37k stars 1.92k forks source link

optimizer.minimize + tf.maximum breaks tf.memory().numBytes counter #584

Closed justadudewhohacks closed 5 years ago

justadudewhohacks commented 6 years ago

TensorFlow.js version

Browser version

Describe the problem or feature request

There seems to be some memory leak in optimizer.minimize, no matter if I use adam or sgd:

const loss = optimizer.minimize(() => {
  const outTensor = window.net.forwardInput(batchInput, inputSize)
  return tf.sum(outTensor)
}, true)

loss.dispose()
await tf.nextFrame()
console.log(tf.memory())

After some time, chrome memory usage grows higher than multiple GB. Logging tf.memory() furthermore reveals some strange decrementing of numBytes (tracked memory of tensors in RAM I guess?):

minimize_memory_leak

Just to point out, the leak isn't occuring due to net.forwardInput or tf.sum, since the following code runs without any leaks:

const outTensor = window.net.forwardInput(batchInput, inputSize)
const sum = tf.sum(outTensor)
outTensor.dispose()
sum.dispose()
await tf.nextFrame()
console.log(tf.memory())

forward

Edit.:

Some more clarification: The net is a combination of separableConv2d's + max pooling ops, with a single 1x1 convolution at the end. The output of net.forwardInput in the example is a 1x13x13x25 tensor.

It might be, that the issue is due to backpropagation through separableConv2d's.


I also ran the abovementioned example with a different net, which consists of conv2d's + max pooling ops, producing a 1x136 output tensor, coming up with different results for tfjs-core 0.11.9 and tfjs-core 0.12.9, running the exact same code:

tfjs-core 0.11.9: works fine, without any leaks

tfjs-core 0.12.9: crashes in the first iteration causing chromes memory to quickly rise above 3GB

dsmilkov commented 6 years ago

Thanks for investigating this. Is there a chance you could share your repo with us? In the meantime, I'll try to reproduce this based on your pointers.

justadudewhohacks commented 6 years ago

Not sure if it helps, but here is the tiny-yolov2-seperable-conv2d branch I am currently working on, which is where I am facing the issue.

Basically the first issue occurs when backpropagating through the tiny yolov2 implementation with separable convolutions. The code for training is under /tools/train/tinyYolov2.

The second issue occurs when backpropagating through the face landmark net (/tools/train/faceLandmarks).

I will try to come up with a repo with some simpler example code to reproduce the issue, as soon as I have time, which might be simpler to debug.

justadudewhohacks commented 6 years ago

Okay after setting up an example repo to reproduce the issue, I figured out, that the issue is not related to tf.separableConv2d, but it's caused by tf.maximum.

Example repo is here.

justadudewhohacks commented 6 years ago

Okay after spending some more time on this problem I figured out, that using tf.maximum only messes with the numBytes counter, as shown in the screenshot. Apparently it doesn't cause the memory leak that I am facing.

I found what's causing the memory leak and opened a seperate issue for that: #604

dsmilkov commented 5 years ago

Closing since this got fixed.