Closed danwexler closed 2 years ago
@danwexler just want to make sure there are no memory leaks, in your preprocessing code:
for (i=0; i < N; ++i) stack.push(tf.node.decodeImage(fs.readFileSync(filename), 3))
I assume you have dispose the tensors within stack array? can you show the tf.memory output before and after the inference? thanks
Yes, apologies, I was just mocking the real function in the bug report. As I said, I print tf.memory()
after each frame to ensure there are no new additional tensors or memory allocated. Here's the full code for my pre-fetch function:
const stackTensors = (imagesExp: string, batchStartFrame: number, batchSize: number) => {
const tensors: Array<tf.Tensor3D> = []
for (let j = 0; j < batchSize; ++j) {
const idx = batchStartFrame + j + 1
const frame = sprintf(imagesExp, idx)
const tensor: tf.Tensor3D = loadImageAsTensor3D(frame)
if (tensor) tensors.push(tensor)
}
const batch: tf.Tensor4D = <tf.Tensor4D> tf.stack(tensors)
tensors.forEach(tensor => tensor.dispose())
return batch
}
Here's a typical output from tf.memory()
:
2021-10-15T01:05:21.796333333Z Task starting:
2021-10-15T01:05:21.796398331Z {
2021-10-15T01:05:21.796486309Z "TensorFlowMemory": {
2021-10-15T01:05:21.796489621Z "unreliable": true,
2021-10-15T01:05:21.796492909Z "numTensors": 308,
2021-10-15T01:05:21.796496160Z "numDataBuffers": 308,
2021-10-15T01:05:21.796499492Z "numBytes": 16704400
2021-10-15T01:05:21.796502823Z }
2021-10-15T01:05:21.796506027Z }
2021-10-15T01:05:25.044486894Z Task completed:
2021-10-15T01:05:25.044580754Z {
2021-10-15T01:05:25.044670002Z "TensorFlowMemory": {
2021-10-15T01:05:25.044672942Z "unreliable": true,
2021-10-15T01:05:25.044675892Z "numTensors": 308,
2021-10-15T01:05:25.044678802Z "numDataBuffers": 308,
2021-10-15T01:05:25.044681744Z "numBytes": 16704400
2021-10-15T01:05:25.044684701Z }
2021-10-15T01:05:25.044687538Z }
The allocated memory is the core upscale layer model, after warmup/predict.
@danwexler the other thing I want to confirm is that are you using tfjs model or tf saved model for inference?
I'm using a pretrained super-resolution model loaded from a cached version of the Idealo ESRGAN. The model is currently loaded from Unpkg at this location: https://unpkg.com/@upscalerjs/models@0.8.27/idealo/gans
using tf.loadLayersModel()
. That version is provided by the author of the npm upscaler
package.
IOW, this is not a TFJS-provided model from TFHub, and I do believe it is a TF saved model. Please correct me if I'm wrong as I did not do the original training. I feel very much like I need to understand more about the internals of how models work in order to understand this issue.
I believe these are the model files: gans.zip
Looking at this model file, it seems to be a Keras 2.4.0 model converted using the TFJS Converter v2.0.1.post1
FYI, this is all part of an unannounced product that is in development which allows you to run TFJS models both locally in the browser and at scale on a dedicated cluster of cloud VMs. So I do run this code both in tfjs-node-gpu
and in the browser with tfjs
, however the browser is typically used to adjust settings on a single frame rather than rendering the entire video. You can run the entire video processing locally too, it just runs much faster when split up across multiple VMs and on bigger GPUs.
@danwexler are you using cuda 11.2? I believe TF 2.5.0+ would require 11.2 at least. seems this problem is fixed in the upcoming TF 2.7.0 https://github.com/tensorflow/tensorflow/issues/48545
Understood. Good info. Unfortunately, 11.2 is not available using the default Google Kubernetes Engine (GKE) nvidia-driver-installer Daemon Set.
I've upgraded to using the tensorflow/tensorflow:nightly-gpu
Docker base file, and upgraded my GKE backplane to the Rapid channel since the backplane version changes the base NVIDIA driver and CUDA version. Unfortunately, it looks like that still installs only CUDA v11.0..
I believe there is a way to install a different driver than is installed on GKE based on the backplane version. Do you know of any documentation or instructions on how to upgrade the CUDA version on a GKE VM via the standard nvidia-driver-installer Daemon Set?
This is not a blocking issue for me during development. I'll be testing workarounds while I wait for the TF 2.7.0 release.
However, it would be great if there was a way to reuse existing allocations rather than re-allocating the same large tensors for data pre-fetch and model.predict()
. That would definitely avoid fragmentation with user control at the API level. Otherwise, it seems to me that the current allocator is just not optimized to detect existing free blocks to re-use, at least for larger blocks? Hopefully the cuda_malloc_async allocator is an improvement in this regard. Alternatively, I plan to look at tf.engine.reset()
to clear out the entire allocator and re-load my model from scratch every N frames. Any other workarounds I should explore?
@danwexler engine reset could de-allocate all your weight tensors for the model, you would need to recreate and upload them to gpu again, and I am not sure it will improve GPU memory fragmentation.
Using tfjs-node-gpu on a GKE cluster running on an n1-higmem-8 with an NVIDIA P4 or V100 GPU fails when the cuda_malloc_async allocater is set using TF_GPU_ALLOCATOR.
System information
Describe the current behavior
The app is a video filter that loads applies a super-resolution layer model to each frame in a video file, batching N frames together into a Tensor4D to scale up the resolution by 4x. I run
tf.memory()
after each frame to ensure that I am not leaking any tensors. After processing slightly more than 100x 1280x720 frames correctly, TF bails out and dumps the memory allocations, as well as displaying the message:However, when I do set
TF_GPU_ALLOCATOR=cuda_malloc_async
, my normally correct startup process fails with:Describe the expected behavior My primary issue is being able to use
model.predict()
on several hundred video frames, grouped together into batches, without running out of memory. I have eliminated any tensor leaks according totf.memory()
, so I'm not sure what to try next? I have seen discussions mentioningtf.engine.startScope/endScope
, and I can also trydispose()
ing my model every N frames and re-loading it, or eventf.engine.reset()
every N frames, but these seem like band-aids for internal TFJS issues.I do not explicitly allocate any TF variables within my code, so I do not expect
tf.disposeVariables()
to help. Is it possible that the model allocates variables internally that would benefit from runningtf.disposeVariables()
every frame?I repeat the same allocation pattern for each video frame batch, but I cannot find any way of re-using the existing Tensors to avoid fragmentation.
Standalone code to reproduce the issue Producing repro code is possible, but a significant effort. If there are no simple answers to this issue, then I will take the necessary time to mock up a repro.
Basically, I start by decoding frames into separate files using ffmpeg. Then, the processing loop will pre-fetch the next batch of N frames (N typically is 1-10) into a T4D by loading the individual frames:
Once pre-fetched, processing is just:
superresModel.predict(t4d)
The output batch is finished, I then extract the individual frames and save them back to new output files using:
After all batches are finished, I just call
ffmpeg
again to re-encode the output frame files.Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
err.log nvidia_smi.log