Closed bhelm closed 2 years ago
Including a dependency should not set CUDA_VISIBLE_DEVICES or any other ENV, as this can cause side-affects to other applications.
Well, at the time this library was started there was no other way to (reliably) control the device used by Tensorflow. Since configuration happened at initialisation time (i.e. once tf was first included), changing environment variables was the only option.
Since os.environ only affects the current process and this library wasn't meant to be used as a loader, but for converting, I didn't see a problem with this.
To my knowledge using environment variables is still this the only way to control the log-level from Python, but I will take another look at limiting visible devices via API calls and whether that changed with the newer versions.
I understand. maybe using with tf.device("/cpu:0"), but im just guessing.
Thank you :)
After about an hour of testing and another hour or so digging through both the Python and the C++ code base of Tensorflow, I can now say with confidence that there's no way to reliably disable NVIDIA GPUs using the TF Python API.
The converter doesn't use the TF2 interface, because it converts graph models which aren't used by the TF2 compute model.
It needs to access low-level APIs that aren't affected by the context manager used by tf.device
or TF1.x's ConfigProto
.
The graph model and the internal optimizer ("grappler" in TF-lingo) don't know about the Python API and rely on the C++ platform manager instead. So while the Python API can be used to set the device for model training and inference, it doesn't affect any low-level processing like graph manipulation.
The environment variable (CUDA_VISIBLE_DEVICES
) isn't even known by Tensorflow - it's an NVIDIA driver configuration.
The reason the converter needs to run on the CPU is memory and compute capability. The CUDA driver likes to lock up if models are converted on weaker (or just older) GPUs and error out if not enough memory is available. Running the converter on the CPU only (or on ROCM - I haven't tested on AMD or Intel GPUs using ROCM yet) ensures that converting doesn't randomly fail because there's the "wrong" NVIDIA GPU installed on the system.
I will get back to this later tomorrow and test whether resetting the environment variable works for re-enabling CUDA after converting is finished.
I've added the option to enable CUDA. This will be available in the next release.
Basically the default will still be to run any script that includes the converter in CPU-only mode, but optionally CUDA can be enabled like so:
from typing import List
import sys
import tfjs_graph_converter as tfjs
def main(args: List[str]) -> None:
if '--enable-cuda' in args:
tfjs.api.enable_cuda()
graph = tfjs.api.load_graph_model('models/some_tfjs_graph_model')
model = tfjs.api.graph_to_function_v2(graph)
inputs = ...
# inference will run on CUDA-device if available
result = model(inputs)
# CUDA-capable GPU will be available for use with other libraries and tf functions, too
if __name__ == '__main__':
main(sys.argv)
It doesn't really matter when and where enable_cuda()
is called, as long as it happens before any Tensorflow or graph converter function is called.
I'll document the change and package a release in a bit.
Ok, thank you for caring, i think that should do the trick. :+1:
Demo Script:
output:
a
os.unsetenv('CUDA_VISIBLE_DEVICES')
directly after the include works around the problem.My System: Debian 10, Python 3.8/3.9 with
this is where it happens at the source:
tfjs_graph_converter/__init__.py
:Including a dependency should not set CUDA_VISIBLE_DEVICES or any other ENV, as this can cause side-affects to other applications. In my case, this problem caused virtual-webcam to not use GPU acceleration even if it is available, leading to high load and low latency without any hint why it happens. It took me 6 hours to figure this out. It may be possible to scope that "disable gpu for conversion" only to the conversion functions if this is really required or wrap it into a function so the developer using the library can decide.
Thank You.