[tensorflow/tfjs][tfjs-node-gpu] cuda_malloc_async fails with CUDA device attribute error

danwexler commented 3 years ago

Using tfjs-node-gpu on a GKE cluster running on an n1-higmem-8 with an NVIDIA P4 or V100 GPU fails when the cuda_malloc_async allocater is set using TF_GPU_ALLOCATOR.

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow.js): YES
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04 64bit
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: none
TensorFlow.js installed from (npm or script link): npm
TensorFlow.js version (use command below): 3.9.0
Browser version: none, only tested in node
Node version: 14.15.3
Tensorflow.js Converter Version: none

Describe the current behavior

The app is a video filter that loads applies a super-resolution layer model to each frame in a video file, batching N frames together into a Tensor4D to scale up the resolution by 4x. I run tf.memory() after each frame to ensure that I am not leaking any tensors. After processing slightly more than 100x 1280x720 frames correctly, TF bails out and dumps the memory allocations, as well as displaying the message:

If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.

However, when I do set TF_GPU_ALLOCATOR=cuda_malloc_async, my normally correct startup process fails with:

tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:72] Failed to get device attribute: CUDA error: invalid argument (CUDA_ERROR_INVALID_VALUE)

Describe the expected behavior My primary issue is being able to use model.predict() on several hundred video frames, grouped together into batches, without running out of memory. I have eliminated any tensor leaks according to tf.memory(), so I'm not sure what to try next? I have seen discussions mentioning tf.engine.startScope/endScope, and I can also try dispose()ing my model every N frames and re-loading it, or even tf.engine.reset() every N frames, but these seem like band-aids for internal TFJS issues.

I do not explicitly allocate any TF variables within my code, so I do not expect tf.disposeVariables() to help. Is it possible that the model allocates variables internally that would benefit from running tf.disposeVariables() every frame?

I repeat the same allocation pattern for each video frame batch, but I cannot find any way of re-using the existing Tensors to avoid fragmentation.

Standalone code to reproduce the issue Producing repro code is possible, but a significant effort. If there are no simple answers to this issue, then I will take the necessary time to mock up a repro.

Basically, I start by decoding frames into separate files using ffmpeg. Then, the processing loop will pre-fetch the next batch of N frames (N typically is 1-10) into a T4D by loading the individual frames:

const stack = []
for (i=0; i < N; ++i) stack.push(tf.node.decodeImage(fs.readFileSync(filename), 3))
const t4d = tf.stack(stack)

Once pre-fetched, processing is just: superresModel.predict(t4d)

The output batch is finished, I then extract the individual frames and save them back to new output files using:

const saveTensor3DAsImageFile = async (tensor, frameIdx, dstExpr) => {
  const executedImage = await tf.node.encodePng(tensor)
  tensor.dispose()
  const filename = sprintf(dstExpr, frameIdx) // image output path
  fs.writeFileSync(filename, executedImage)
}

After all batches are finished, I just call ffmpeg again to re-encode the output frame files.

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

err.log nvidia_smi.log

pyu10055 commented 3 years ago

@danwexler just want to make sure there are no memory leaks, in your preprocessing code:

for (i=0; i < N; ++i) stack.push(tf.node.decodeImage(fs.readFileSync(filename), 3))

I assume you have dispose the tensors within stack array? can you show the tf.memory output before and after the inference? thanks

danwexler commented 3 years ago

Yes, apologies, I was just mocking the real function in the bug report. As I said, I print tf.memory() after each frame to ensure there are no new additional tensors or memory allocated. Here's the full code for my pre-fetch function:

const stackTensors = (imagesExp: string, batchStartFrame: number, batchSize: number) => {
  const tensors: Array<tf.Tensor3D> = []
  for (let j = 0; j < batchSize; ++j) {
    const idx = batchStartFrame + j + 1
    const frame = sprintf(imagesExp, idx)
    const tensor: tf.Tensor3D = loadImageAsTensor3D(frame)
    if (tensor) tensors.push(tensor)
  }
  const batch: tf.Tensor4D = <tf.Tensor4D> tf.stack(tensors)
  tensors.forEach(tensor => tensor.dispose())
  return batch
}

danwexler commented 3 years ago

Here's a typical output from tf.memory():

2021-10-15T01:05:21.796333333Z Task starting:
2021-10-15T01:05:21.796398331Z {
2021-10-15T01:05:21.796486309Z   "TensorFlowMemory": {
2021-10-15T01:05:21.796489621Z     "unreliable": true,
2021-10-15T01:05:21.796492909Z     "numTensors": 308,
2021-10-15T01:05:21.796496160Z     "numDataBuffers": 308,
2021-10-15T01:05:21.796499492Z     "numBytes": 16704400
2021-10-15T01:05:21.796502823Z   }
2021-10-15T01:05:21.796506027Z }
2021-10-15T01:05:25.044486894Z Task completed:
2021-10-15T01:05:25.044580754Z {
2021-10-15T01:05:25.044670002Z   "TensorFlowMemory": {
2021-10-15T01:05:25.044672942Z     "unreliable": true,
2021-10-15T01:05:25.044675892Z     "numTensors": 308,
2021-10-15T01:05:25.044678802Z     "numDataBuffers": 308,
2021-10-15T01:05:25.044681744Z     "numBytes": 16704400
2021-10-15T01:05:25.044684701Z   }
2021-10-15T01:05:25.044687538Z }

The allocated memory is the core upscale layer model, after warmup/predict.

pyu10055 commented 3 years ago

@danwexler the other thing I want to confirm is that are you using tfjs model or tf saved model for inference?

danwexler commented 3 years ago

I'm using a pretrained super-resolution model loaded from a cached version of the Idealo ESRGAN. The model is currently loaded from Unpkg at this location: https://unpkg.com/@upscalerjs/models@0.8.27/idealo/gans using tf.loadLayersModel(). That version is provided by the author of the npm upscaler package.

IOW, this is not a TFJS-provided model from TFHub, and I do believe it is a TF saved model. Please correct me if I'm wrong as I did not do the original training. I feel very much like I need to understand more about the internals of how models work in order to understand this issue.

I believe these are the model files: gans.zip

Looking at this model file, it seems to be a Keras 2.4.0 model converted using the TFJS Converter v2.0.1.post1

danwexler commented 3 years ago

FYI, this is all part of an unannounced product that is in development which allows you to run TFJS models both locally in the browser and at scale on a dedicated cluster of cloud VMs. So I do run this code both in tfjs-node-gpu and in the browser with tfjs, however the browser is typically used to adjust settings on a single frame rather than rendering the entire video. You can run the entire video processing locally too, it just runs much faster when split up across multiple VMs and on bigger GPUs.

pyu10055 commented 3 years ago

@danwexler are you using cuda 11.2? I believe TF 2.5.0+ would require 11.2 at least. seems this problem is fixed in the upcoming TF 2.7.0 https://github.com/tensorflow/tensorflow/issues/48545

danwexler commented 3 years ago

Understood. Good info. Unfortunately, 11.2 is not available using the default Google Kubernetes Engine (GKE) nvidia-driver-installer Daemon Set.

I've upgraded to using the tensorflow/tensorflow:nightly-gpu Docker base file, and upgraded my GKE backplane to the Rapid channel since the backplane version changes the base NVIDIA driver and CUDA version. Unfortunately, it looks like that still installs only CUDA v11.0..

I believe there is a way to install a different driver than is installed on GKE based on the backplane version. Do you know of any documentation or instructions on how to upgrade the CUDA version on a GKE VM via the standard nvidia-driver-installer Daemon Set?

This is not a blocking issue for me during development. I'll be testing workarounds while I wait for the TF 2.7.0 release.

However, it would be great if there was a way to reuse existing allocations rather than re-allocating the same large tensors for data pre-fetch and model.predict(). That would definitely avoid fragmentation with user control at the API level. Otherwise, it seems to me that the current allocator is just not optimized to detect existing free blocks to re-use, at least for larger blocks? Hopefully the cuda_malloc_async allocator is an improvement in this regard. Alternatively, I plan to look at tf.engine.reset() to clear out the entire allocator and re-load my model from scratch every N frames. Any other workarounds I should explore?

pyu10055 commented 3 years ago

@danwexler engine reset could de-allocate all your weight tensors for the model, you would need to recreate and upload them to gpu again, and I am not sure it will improve GPU memory fragmentation.

google-ml-butler[bot] commented 2 years ago

Are you satisfied with the resolution of your issue? Yes No

tensorflow / tfjs

[tensorflow/tfjs][tfjs-node-gpu] cuda_malloc_async fails with CUDA device attribute error #5740