Open aneeshjain opened 3 years ago
Enable debug log in TF. Watch nvidia-smi. Might need to update cuda and/or cudnn or both. Being stuck likely means driver unable to reach hardware aka GPU and sits there waiting for access. https://www.tensorflow.org/api_docs/python/tf/data/experimental/enable_debug_mode
@vitalyli I did enable the debug log by putting in tf.data.experimental.enable_debug_mode() as suggested. However I did not see anything different in the output that would help me diagnose the problem. Also below is a snapshot of nvidia-smi at the moment the program gets stuck
I managed to get the debugger working and got the following error along with multiple deprecation warnings
labels, mode, params, config)
File "/home/aneesh/.local/lib/python3.7/site-packages/tensorflow_ranking/python/model.py", line 385, in _compute_logits_impl
config)
File "tf_ranking_libsvm.py", line 333, in _score_fn
input_layer, training=is_training)
File "/usr/local/lib/python3.7/dist-packages/keras/legacy_tf_layers/normalization.py", line 454, in batch_normalization
return layer.apply(inputs, training=training)
File "/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer_v1.py", line 1679, in apply
return self.__call__(inputs, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/keras/legacy_tf_layers/base.py", line 567, in __call__
outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer_v1.py", line 745, in __call__
self._maybe_build(inputs)
File "/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer_v1.py", line 2068, in _maybe_build
self.build(input_shapes)
File "/usr/local/lib/python3.7/dist-packages/keras/layers/normalization/batch_normalization.py", line 373, in build
(tuple(input_shape), self.axis))
ValueError: Input has undefined `axis` dimension. Received input with shape (None, None). Axis value: [1]
INFO:tensorflow:Disabled dumping callback in thread MainThread (dump root: /tmp/my-tfdbg-dumps)
I1014 20:08:00.762151 139665619777344 dumping_callback.py:897] Disabled dumping callback in thread MainThread (dump root: /tmp/my-tfdbg-dumps)```
Anything in that thread dump? It's stuck somewhere likely in the driver. Just to be clear I have not experienced this problem with my GPUs, but it's something inside driver. May make sense to reach out to Nvidia.
Ive been trying to run some experiments using the Yahoo! LTR dataset. When I was running tf_ranking_libsvm.py on a CPU it was running fine. But ever since I started running it on the GPU it just starts and gets stuck at the 0th step.
Ive been using the following build versions:
tensorboard==2.6.0 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.0 tensorflow==2.6.0 tensorflow-addons==0.14.0 tensorflow-datasets==4.4.0 tensorflow-estimator==2.6.0 tensorflow-hub==0.12.0 tensorflow-metadata==1.2.0 tensorflow-model-optimization==0.7.0 tensorflow-ranking==0.4.2 tensorflow-serving-api==2.6.0
python3.6 tf_ranking_libsvm.py --test_path ltrc_yahoo/small.test.txt --train_path ltrc_yahoo/small.train.txt --vali_path ltrc_yahoo/small.valid.txt --output_dir ./gpu_test_outputs --num_features 700 --train_batch_size 8 --num_train_steps 5000 --list_size 100 --hidden_layer_dims 64,32,8
What could be a possible reason for this?