juglab / n2v

This is the implementation of Noise2Void training.
Other
394 stars 108 forks source link

Issue with XLA devices #119

Open snoreis opened 2 years ago

snoreis commented 2 years ago

Hi All,

I'm having some trouble running N2V. I have a computer with NVIDIA RTXx A5000 and Ubuntu 18.04.

I use conda to install N2V according to:

$ conda create -n 'n2v' python=3.7 $ source activate n2v $ conda install tensorflow-gpu=2.4.1 keras=2.3.1 $ pip install jupyter $ pip install n2v

and then run the jupyter notebook given here.

Everything runs smoothly until I get to the line:

model = N2V(config, model_name, basedir=basedir)

which takes about 5 minutes to execute and I get the following output:

/home/sam/miniconda3/envs/n2v/lib/python3.7/site-packages/n2v/models/n2v_standard.py:416: UserWarning: output path for model already exists, files may be overwritten: /home/sam/models/BSD68_reproducability_5x5 'output path for model already exists, files may be overwritten: %s' % str(self.logdir.resolve())) 2022-03-04 10:33:12.171788: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2022-03-04 10:33:12.172307: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2022-03-04 10:33:12.205486: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-03-04 10:33:12.205618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:61:00.0 name: NVIDIA RTX A5000 computeCapability: 8.6 coreClock: 1.695GHz coreCount: 64 deviceMemorySize: 23.68GiB deviceMemoryBandwidth: 715.34GiB/s 2022-03-04 10:33:12.205630: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1 2022-03-04 10:33:12.206557: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10 2022-03-04 10:33:12.206579: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10 2022-03-04 10:33:12.207523: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2022-03-04 10:33:12.207667: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2022-03-04 10:33:12.208436: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2022-03-04 10:33:12.208839: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10 2022-03-04 10:33:12.210569: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7 2022-03-04 10:33:12.210663: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-03-04 10:33:12.210860: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-03-04 10:33:12.210941: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2022-03-04 10:33:12.211272: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-03-04 10:33:12.212089: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-03-04 10:33:12.212184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:61:00.0 name: NVIDIA RTX A5000 computeCapability: 8.6 coreClock: 1.695GHz coreCount: 64 deviceMemorySize: 23.68GiB deviceMemoryBandwidth: 715.34GiB/s 2022-03-04 10:33:12.212194: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1 2022-03-04 10:33:12.212205: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10 2022-03-04 10:33:12.212212: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10 2022-03-04 10:33:12.212218: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2022-03-04 10:33:12.212224: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2022-03-04 10:33:12.212230: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2022-03-04 10:33:12.212236: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10 2022-03-04 10:33:12.212243: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7 2022-03-04 10:33:12.212276: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-03-04 10:33:12.212384: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-03-04 10:33:12.212461: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2022-03-04 10:33:12.212481: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1 2022-03-04 10:38:35.925719: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2022-03-04 10:38:35.925741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 2022-03-04 10:38:35.925746: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N 2022-03-04 10:38:35.925939: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-03-04 10:38:35.926079: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-03-04 10:38:35.926191: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-03-04 10:38:35.926283: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0. 2022-03-04 10:38:35.926306: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21899 MB memory) -> physical GPU (device: 0, name: NVIDIA RTX A5000, pci bus id: 0000:61:00.0, compute capability: 8.6) 2022-03-04 10:38:35.926510: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set

If I then continue with:

history = model.train(X, X_val)

I get the following output, after which, it just stops:

/home/sam/miniconda3/envs/n2v/lib/python3.7/site-packages/n2v/models/n2v_standard.py:194: UserWarning: small number of validation images (only 0.1% of all images) warnings.warn("small number of validation images (only %.1f%% of all images)" % (100 * frac_val))

8 blind-spots will be generated per training patch of size (64, 64).

Preparing validation data: 100%|██████████████████████████████████████| 4/4 [00:00<00:00, 533.12it/s]

Epoch 1/200

2022-03-04 10:40:59.811781: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2) 2022-03-04 10:40:59.830334: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3892985000 Hz 2022-03-04 10:41:00.455851: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7

Any thoughts??

Thanks! Sam

snoreis commented 2 years ago

OK, so I think I've fixed some of it. I changed the N2V installation procedure to:

conda create --name n2v python=3.7
conda activate n2v
conda install -c anaconda cudatoolkit
conda install cudatoolkit=11.0.*
conda install cudnn=8.0.*
conda install jupyter
pip install tensorflow-gpu==2.4.1
pip install n2v

the key difference being that I used tensorflow-gpu instead of just tensorflow which makes sense. Maybe that can be added to the documentation?

However, the string of warning messages after model = N2V(config, model_name, basedir=basedir) still exists, but executes quickly.

Then, when I run the training I get a repeating message of:

` 2022-03-04 19:57:19.446524: W tensorflow/stream_executor/gpu/asm_compiler.cc:235] Your CUDA software stack is old. We fallback to the NVIDIA driver for some compilation. Update your CUDA version to get the best performance. The ptxas error was: ptxas fatal : Value 'sm_86' is not defined for option 'gpu-name'

Which eventually finishes but the messages cause the training to move very slowly. I know that because if I do a low number of epochs and steps, then execute the training cell again once finished, the error messages go away and everything goes according to plan.

Hope this helps anyone else in need! and if anyone has an idea what to do with the "Your CUDA software stack is old. " message, that would be of great help!

Sam `