Including tfjs_graph_converter disables GPU

bhelm commented 2 years ago

Demo Script:

import os
import tensorflow as tf
print("cuda devices before include", os.getenv('CUDA_VISIBLE_DEVICES'))

import tfjs_graph_converter.api as tfjs_api
import tfjs_graph_converter.util as tfjs_util

print("cuda devices after include", os.getenv('CUDA_VISIBLE_DEVICES'))
physical_devices = tf.config.list_physical_devices('GPU')
print("Num GPUs:", len(physical_devices))

output:

cuda devices before include None
cuda devices after include -1
Num GPUs: 0

a os.unsetenv('CUDA_VISIBLE_DEVICES') directly after the include works around the problem.

My System: Debian 10, Python 3.8/3.9 with

tensorflow                   2.8.0
tensorflow-hub               0.12.0
tensorflow-io-gcs-filesystem 0.24.0
tensorflowjs                 3.13.0
tfjs-graph-converter         1.4.2

this is where it happens at the source: tfjs_graph_converter/__init__.py:

# disable CUDA devices - we only want the CPU do work with data
os.environ['CUDA_VISIBLE_DEVICES'] = "-1"

Including a dependency should not set CUDA_VISIBLE_DEVICES or any other ENV, as this can cause side-affects to other applications. In my case, this problem caused virtual-webcam to not use GPU acceleration even if it is available, leading to high load and low latency without any hint why it happens. It took me 6 hours to figure this out. It may be possible to scope that "disable gpu for conversion" only to the conversion functions if this is really required or wrap it into a function so the developer using the library can decide.

Thank You.

patlevin commented 2 years ago

Including a dependency should not set CUDA_VISIBLE_DEVICES or any other ENV, as this can cause side-affects to other applications.

Well, at the time this library was started there was no other way to (reliably) control the device used by Tensorflow. Since configuration happened at initialisation time (i.e. once tf was first included), changing environment variables was the only option.

Since os.environ only affects the current process and this library wasn't meant to be used as a loader, but for converting, I didn't see a problem with this.

To my knowledge using environment variables is still this the only way to control the log-level from Python, but I will take another look at limiting visible devices via API calls and whether that changed with the newer versions.

bhelm commented 2 years ago

I understand. maybe using with tf.device("/cpu:0"), but im just guessing.

Thank you :)

patlevin commented 2 years ago

After about an hour of testing and another hour or so digging through both the Python and the C++ code base of Tensorflow, I can now say with confidence that there's no way to reliably disable NVIDIA GPUs using the TF Python API.

The converter doesn't use the TF2 interface, because it converts graph models which aren't used by the TF2 compute model. It needs to access low-level APIs that aren't affected by the context manager used by tf.device or TF1.x's ConfigProto.

The graph model and the internal optimizer ("grappler" in TF-lingo) don't know about the Python API and rely on the C++ platform manager instead. So while the Python API can be used to set the device for model training and inference, it doesn't affect any low-level processing like graph manipulation.

The environment variable (CUDA_VISIBLE_DEVICES) isn't even known by Tensorflow - it's an NVIDIA driver configuration.

The reason the converter needs to run on the CPU is memory and compute capability. The CUDA driver likes to lock up if models are converted on weaker (or just older) GPUs and error out if not enough memory is available. Running the converter on the CPU only (or on ROCM - I haven't tested on AMD or Intel GPUs using ROCM yet) ensures that converting doesn't randomly fail because there's the "wrong" NVIDIA GPU installed on the system.

I will get back to this later tomorrow and test whether resetting the environment variable works for re-enabling CUDA after converting is finished.

patlevin commented 2 years ago

I've added the option to enable CUDA. This will be available in the next release.

Basically the default will still be to run any script that includes the converter in CPU-only mode, but optionally CUDA can be enabled like so:

from typing import List
import sys
import tfjs_graph_converter as tfjs

def main(args: List[str]) -> None:
    if '--enable-cuda' in args:
        tfjs.api.enable_cuda()
    graph = tfjs.api.load_graph_model('models/some_tfjs_graph_model')
    model = tfjs.api.graph_to_function_v2(graph)
    inputs = ...
    # inference will run on CUDA-device if available
    result = model(inputs)
    # CUDA-capable GPU will be available for use with other libraries and tf functions, too

if __name__ == '__main__':
    main(sys.argv)

It doesn't really matter when and where enable_cuda() is called, as long as it happens before any Tensorflow or graph converter function is called.

I'll document the change and package a release in a bit.

bhelm commented 2 years ago

Ok, thank you for caring, i think that should do the trick. :+1:

patlevin / tfjs-to-tf

Including tfjs_graph_converter disables GPU #35