Issue with XLA devices - Githubissues

Hi All,

I'm having some trouble running N2V. I have a computer with NVIDIA RTXx A5000 and Ubuntu 18.04.

I use conda to install N2V according to:

$ conda create -n 'n2v' python=3.7 $ source activate n2v $ conda install tensorflow-gpu=2.4.1 keras=2.3.1 $ pip install jupyter $ pip install n2v

and then run the jupyter notebook given here.

Everything runs smoothly until I get to the line:

model = N2V(config, model_name, basedir=basedir)

which takes about 5 minutes to execute and I get the following output:

/home/sam/miniconda3/envs/n2v/lib/python3.7/site-packages/n2v/models/n2v_standard.py:416: UserWarning: output path for model already exists, files may be overwritten: /home/sam/models/BSD68_reproducability_5x5 'output path for model already exists, files may be overwritten: %s' % str(self.logdir.resolve())) 2022-03-04 10:33:12.171788: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2022-03-04 10:33:12.172307: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2022-03-04 10:33:12.205486: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-03-04 10:33:12.205618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:61:00.0 name: NVIDIA RTX A5000 computeCapability: 8.6 coreClock: 1.695GHz coreCount: 64 deviceMemorySize: 23.68GiB deviceMemoryBandwidth: 715.34GiB/s 2022-03-04 10:33:12.205630: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1 2022-03-04 10:33:12.206557: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10 2022-03-04 10:33:12.206579: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10 2022-03-04 10:33:12.207523: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2022-03-04 10:33:12.207667: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2022-03-04 10:33:12.208436: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2022-03-04 10:33:12.208839: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10 2022-03-04 10:33:12.210569: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7 2022-03-04 10:33:12.210663: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-03-04 10:33:12.210860: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-03-04 10:33:12.210941: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2022-03-04 10:33:12.211272: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-03-04 10:33:12.212089: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-03-04 10:33:12.212184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:61:00.0 name: NVIDIA RTX A5000 computeCapability: 8.6 coreClock: 1.695GHz coreCount: 64 deviceMemorySize: 23.68GiB deviceMemoryBandwidth: 715.34GiB/s 2022-03-04 10:33:12.212194: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1 2022-03-04 10:33:12.212205: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10 2022-03-04 10:33:12.212212: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10 2022-03-04 10:33:12.212218: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2022-03-04 10:33:12.212224: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2022-03-04 10:33:12.212230: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2022-03-04 10:33:12.212236: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10 2022-03-04 10:33:12.212243: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7 2022-03-04 10:33:12.212276: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-03-04 10:33:12.212384: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-03-04 10:33:12.212461: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2022-03-04 10:33:12.212481: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1 2022-03-04 10:38:35.925719: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2022-03-04 10:38:35.925741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 2022-03-04 10:38:35.925746: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N 2022-03-04 10:38:35.925939: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-03-04 10:38:35.926079: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-03-04 10:38:35.926191: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-03-04 10:38:35.926283: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0. 2022-03-04 10:38:35.926306: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21899 MB memory) -> physical GPU (device: 0, name: NVIDIA RTX A5000, pci bus id: 0000:61:00.0, compute capability: 8.6) 2022-03-04 10:38:35.926510: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set

If I then continue with:

history = model.train(X, X_val)

I get the following output, after which, it just stops:

/home/sam/miniconda3/envs/n2v/lib/python3.7/site-packages/n2v/models/n2v_standard.py:194: UserWarning: small number of validation images (only 0.1% of all images) warnings.warn("small number of validation images (only %.1f%% of all images)" % (100 * frac_val))

8 blind-spots will be generated per training patch of size (64, 64).

Preparing validation data: 100%|██████████████████████████████████████| 4/4 [00:00<00:00, 533.12it/s]

Epoch 1/200

2022-03-04 10:40:59.811781: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2) 2022-03-04 10:40:59.830334: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3892985000 Hz 2022-03-04 10:41:00.455851: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7

Any thoughts??

Thanks! Sam

juglab / n2v

Issue with XLA devices #119