Open Vectorrent opened 5 months ago
Hi, @Vectorrent
Thank you for bringing this issue to our attention and I was trying to replicate the same behaviour from my end on my macOS and I'm getting below output with includeOptimizer: true
flag and as you mentioned that issue not happening with includeOptimizer: false
so I also observed same thing so workaround is either disable includeOptimizer flag when saving the model. This avoids saving the optimizer state, preventing the leak. However, you'll need to recreate the optimizer during model loading or TensorFlow.js provides functions for manual memory management. You can try the following approach after each save please refer official documentation for tf.tidy and tf.dispose
await model.save(`file://saved_model`, { includeOptimizer: true });
// Manually dispose of the optimizer
model.optimizer.dispose();
// Dispose of other unused tensors
tf.dispose(xs);
tf.dispose(ys);
Please let me know if I have missed anything here. Thank you for your cooperation and patience.
Thanks for the quick response. Sadly, tf.tidy()
has no effect and tf.dispose()
crashes my training session (for obvious reasons). So, neither of these are a "solution" and we should probably fix the underlying bug in the library. I might have some time to dig into the TFJS code and troubleshoot that, at some point.
Until then, my solution is to 1) create a manual training loop, 2) save the model, 3) unload the model, 4) re-load the model, 5) resume training. Not a great solution, if you ask me :rofl:
I cannot for the life of me figure out how to build TFJS locally on my computer, so I'm not really able to debug or test this properly. Regardless, I've been digging, and this is probably where we need to apply a fix: https://github.com/tensorflow/tfjs/blob/master/tfjs-layers/src/engine/training.ts#L2146
If I had to guess, maybe its related to the use of io.concatenateArrayBuffers
here? Apparently, it's deprecated and we should be using tf.io.CompositeArrayBuffer.join()
instead.
I wrapped the saving of the model in tf.engine().startScope() and tf.engine().endScope() to prevent the leaking tensor.
System information
Describe the current behavior When using
tensorflow-node-gpu
for training, I periodically save models to disk. However, my training has been crashing, and I've just learned why:When
model.save()
includes the optimizer, a single tensor is leaked. This leads to the slow accumulation of unnecessary tensors, and crashes my computer after some amount of time:To be clear, this is before saving a model:
And this is after:
Describe the expected behavior I would expect model-saving to dispose of all unused tensors, after the operation is complete.
Standalone code to reproduce the issue This bug is 100% reproducible in both
tfjs-node
andtfjs-node-gpu
:Other info / logs
includeOptimizer
flag is disabled, then this does not occur.