what could have lead to CUDNN_STATUS_INTERNAL_ERROR ?

yarikoptic commented 5 years ago

It used to work on my laptop, but no longer. I fear it is due to some interaction with GPU being used as an actual graphics card as well, and thus Xorg consuming too much memory (but requested ~1.3GB is less than available free ~2GB) or something like that

nvidia-smi

```shell $> nvidia-smi Mon Nov 11 09:55:21 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Quadro T2000 Off | 00000000:01:00.0 Off | N/A | | N/A 43C P8 3W / N/A | 2297MiB / 3911MiB | 19% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 21824 G /usr/lib/xorg/Xorg 141MiB | | 0 25467 G /usr/lib/xorg/Xorg 1670MiB | | 0 25596 G /usr/bin/gnome-shell 180MiB | | 0 27333 G ...uest-channel-token=14439694130078186709 232MiB | | 0 28802 G /usr/lib/xorg/Xorg 6MiB | | 0 28899 G /usr/bin/gnome-shell 5MiB | +-----------------------------------------------------------------------------+ ```

the actual run via singularity

```shell $> singularity run -e -B /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.430.50 -B /usr/lib/x86_64-linux-gnu/libcuda.so.1 neuronets-kwyk--version-0.4-gpu.sing raiders/sub-rid000005/anat/sub-rid000005_run-01_T1w.nii.gz out Bayesian dropout functions have been loaded. Your version: v0.4 Latest version: 0.4 ++ Conforming volume to 1mm^3 voxels and size 256x256x256. /opt/kwyk/freesurfer/bin/mri_convert: line 2: /opt/kwyk/freesurfer/sources.sh: No such file or directory mri_convert.bin --conform raiders/sub-rid000005/anat/sub-rid000005_run-01_T1w.nii.gz /tmp/tmpwtickiw9.nii.gz $Id: mri_convert.c,v 1.226 2016/02/26 16:15:24 mreuter Exp $ reading from raiders/sub-rid000005/anat/sub-rid000005_run-01_T1w.nii.gz... TR=10.00, TE=0.00, TI=0.00, flip angle=0.00 i_ras = (0, -1, 0) j_ras = (0, 0, 1) k_ras = (1, 0, 0) changing data type from float to uchar (noscale = 0)... MRIchangeType: Building histogram Reslicing using trilinear interpolation writing to /tmp/tmpwtickiw9.nii.gz... ++ Running forward pass of model. 2019-11-11 14:57:43.820728: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2019-11-11 14:57:43.916219: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-11-11 14:57:43.916394: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: Quadro T2000 major: 7 minor: 5 memoryClockRate(GHz): 1.5 pciBusID: 0000:01:00.0 totalMemory: 3.82GiB freeMemory: 1.41GiB 2019-11-11 14:57:43.916409: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-11-11 14:57:44.267550: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-11-11 14:57:44.267570: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-11-11 14:57:44.267575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-11-11 14:57:44.267684: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1246 MB memory) -> physical GPU (device: 0, name: Quadro T2000, pci bus id: 0000:01:00.0, compute capability: 7.5) Normalizer being used -5.8382284e-08 1.0000015 0/64 [..............................] - ETA: 0s2019-11-11 14:57:46.303925: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2019-11-11 14:57:46.314172: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[{{node layer_1/conv3d/Conv3D}} = Conv3D[T=DT_FLOAT, data_format="NDHWC", dilations=[1, 1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1, 1], _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_Placeholder_0_0/_85, layer_1/conv3d/kernel_m)]] During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/bin/kwyk", line 11, in load_entry_point('kwyk', 'console_scripts', 'kwyk')() File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 764, in __call__ return self.main(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 956, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 555, in invoke return callback(*args, **kwargs) File "/opt/kwyk/kwyk/cli.py", line 92, in predict normalizer=zscore) File "/usr/local/lib/python3.5/dist-packages/nobrainer/predict.py", line 348, in predict_from_filepath batch_size=batch_size) File "/usr/local/lib/python3.5/dist-packages/nobrainer/predict.py", line 275, in predict_from_img batch_size=batch_size) File "/usr/local/lib/python3.5/dist-packages/nobrainer/predict.py", line 186, in predict_from_array new_prediction = predictor( {'volume': features[j:j + batch_size]}) File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/predictor/predictor.py", line 77, in __call__ return self._session.run(fetches=self.fetch_tensors, feed_dict=feed_dict) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node layer_1/conv3d/Conv3D (defined at /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/predictor/saved_model_predictor.py:153) = Conv3D[T=DT_FLOAT, data_format="NDHWC", dilations=[1, 1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1, 1], _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_Placeholder_0_0/_85, layer_1/conv3d/kernel_m)]] Caused by op 'layer_1/conv3d/Conv3D', defined at: File "/usr/local/bin/kwyk", line 11, in load_entry_point('kwyk', 'console_scripts', 'kwyk')() File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 764, in __call__ return self.main(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 956, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 555, in invoke return callback(*args, **kwargs) File "/opt/kwyk/kwyk/cli.py", line 83, in predict predictor = _get_predictor(savedmodel_path) File "/usr/local/lib/python3.5/dist-packages/nobrainer/predict.py", line 406, in _get_predictor predictor = tf.contrib.predictor.from_saved_model(str(path)) File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/predictor/predictor_factories.py", line 153, in from_saved_model config=config) File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/predictor/saved_model_predictor.py", line 153, in __init__ loader.load(self._session, tags.split(','), export_dir) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/saved_model/loader_impl.py", line 197, in load return loader.load(sess, tags, import_scope, **saver_kwargs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/saved_model/loader_impl.py", line 350, in load **saver_kwargs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/saved_model/loader_impl.py", line 278, in load_graph meta_graph_def, import_scope=import_scope, **saver_kwargs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1696, in _import_meta_graph_with_return_elements **kwargs)) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/meta_graph.py", line 806, in import_scoped_meta_graph_with_return_elements return_elements=return_elements) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/importer.py", line 442, in import_graph_def _ProcessNewOps(graph) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/importer.py", line 234, in _ProcessNewOps for new_op in graph._add_new_tf_operations(compute_devices=False): # pylint: disable=protected-access File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3440, in _add_new_tf_operations for c_op in c_api_util.new_tf_operations(self) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3440, in for c_op in c_api_util.new_tf_operations(self) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3299, in _create_op_from_tf_operation ret = Operation(c_op, self) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1770, in __init__ self._traceback = tf_stack.extract_stack() UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node layer_1/conv3d/Conv3D (defined at /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/predictor/saved_model_predictor.py:153) = Conv3D[T=DT_FLOAT, data_format="NDHWC", dilations=[1, 1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1, 1], _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_Placeholder_0_0/_85, layer_1/conv3d/kernel_m)]] ```

satra commented 5 years ago

instead of this:

singularity run -e -B /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.430.50 \
-B /usr/lib/x86_64-linux-gnu/libcuda.so.1 neuronets-kwyk--version-0.4-gpu.sing \
raiders/sub-rid000005/anat/sub-rid000005_run-01_T1w.nii.gz out

can you try:

singularity run -e --nv neuronets-kwyk--version-0.4-gpu.sing \
raiders/sub-rid000005/anat/sub-rid000005_run-01_T1w.nii.gz out

yarikoptic commented 5 years ago

with --nv it used to halt, now (there is a bit more of free memory) it proceeds to the same crash.

I found http://tuxvoid.blogspot.com/2017/08/tensorflow-could-not-create-cudnn.html referenced from https://github.com/tensorflow/tensorflow/issues/14048 suggesting that instructing tensor flow to allow_grouth

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)

might help, but I could not figure out where in kwyk or nobrainer to tune that.

neuronets / kwyk

what could have lead to CUDNN_STATUS_INTERNAL_ERROR ? #16