Training gets stuck at first step when running tf_ranking_libsvm.py on a GPU

aneeshjain commented 3 years ago

Ive been trying to run some experiments using the Yahoo! LTR dataset. When I was running tf_ranking_libsvm.py on a CPU it was running fine. But ever since I started running it on the GPU it just starts and gets stuck at the 0th step.

Ive been using the following build versions:

tensorboard==2.6.0 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.0 tensorflow==2.6.0 tensorflow-addons==0.14.0 tensorflow-datasets==4.4.0 tensorflow-estimator==2.6.0 tensorflow-hub==0.12.0 tensorflow-metadata==1.2.0 tensorflow-model-optimization==0.7.0 tensorflow-ranking==0.4.2 tensorflow-serving-api==2.6.0

python3.6 tf_ranking_libsvm.py --test_path ltrc_yahoo/small.test.txt --train_path ltrc_yahoo/small.train.txt --vali_path ltrc_yahoo/small.valid.txt --output_dir ./gpu_test_outputs --num_features 700 --train_batch_size 8 --num_train_steps 5000 --list_size 100 --hidden_layer_dims 64,32,8

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W1006 20:53:52.290486 140366646597440 deprecation.py:548] From /home/aneesh/.local/lib/python3.6/site-packages/tensorflow/python/training/adagrad.py:77: calling Constant.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
INFO:tensorflow:Done calling model_fn.
I1006 20:53:52.361888 140366646597440 estimator.py:1165] Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
I1006 20:53:52.363579 140366646597440 basic_session_run_hooks.py:546] Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
I1006 20:53:55.782585 140366646597440 monitored_session.py:247] Graph was finalized.
2021-10-06 20:53:55.783238: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-10-06 20:53:55.784444: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-06 20:53:55.797132: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-06 20:53:55.798136: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
^[^[^[2021-10-06 20:58:13.921535: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-06 20:58:13.923502: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-06 20:58:13.924767: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-06 20:58:13.926226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10674 MB memory:  -> device: 0, name: GRID P40-12Q, pci bus id: 0000:00:0a.0, compute capability: 6.1
INFO:tensorflow:Running local_init_op.
I1006 20:58:25.759820 140366646597440 session_manager.py:531] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I1006 20:58:25.994846 140366646597440 session_manager.py:534] Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
I1006 20:58:36.989099 140366646597440 basic_session_run_hooks.py:614] Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into ./gpu_test_outputs/model.ckpt.
I1006 20:58:37.009317 140366646597440 basic_session_run_hooks.py:618] Saving checkpoints for 0 into ./gpu_test_outputs/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
I1006 20:58:49.608218 140366646597440 basic_session_run_hooks.py:626] Calling checkpoint listeners after saving checkpoint 0...
INFO:tensorflow:loss = 1.0907192, step = 0
I1006 21:05:17.852562 140366646597440 basic_session_run_hooks.py:262] loss = 1.0907192, step = 0

What could be a possible reason for this?

vitalyli commented 3 years ago

Enable debug log in TF. Watch nvidia-smi. Might need to update cuda and/or cudnn or both. Being stuck likely means driver unable to reach hardware aka GPU and sits there waiting for access. https://www.tensorflow.org/api_docs/python/tf/data/experimental/enable_debug_mode

aneeshjain commented 3 years ago

@vitalyli I did enable the debug log by putting in tf.data.experimental.enable_debug_mode() as suggested. However I did not see anything different in the output that would help me diagnose the problem. Also below is a snapshot of nvidia-smi at the moment the program gets stuck

aneeshjain commented 3 years ago

I managed to get the debugger working and got the following error along with multiple deprecation warnings


    labels, mode, params, config)
  File "/home/aneesh/.local/lib/python3.7/site-packages/tensorflow_ranking/python/model.py", line 385, in _compute_logits_impl
    config)
  File "tf_ranking_libsvm.py", line 333, in _score_fn
    input_layer, training=is_training)
  File "/usr/local/lib/python3.7/dist-packages/keras/legacy_tf_layers/normalization.py", line 454, in batch_normalization
    return layer.apply(inputs, training=training)
  File "/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer_v1.py", line 1679, in apply
    return self.__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/keras/legacy_tf_layers/base.py", line 567, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer_v1.py", line 745, in __call__
    self._maybe_build(inputs)
  File "/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer_v1.py", line 2068, in _maybe_build
    self.build(input_shapes)
  File "/usr/local/lib/python3.7/dist-packages/keras/layers/normalization/batch_normalization.py", line 373, in build
    (tuple(input_shape), self.axis))
ValueError: Input has undefined `axis` dimension. Received input with shape (None, None). Axis value: [1]
INFO:tensorflow:Disabled dumping callback in thread MainThread (dump root: /tmp/my-tfdbg-dumps)
I1014 20:08:00.762151 139665619777344 dumping_callback.py:897] Disabled dumping callback in thread MainThread (dump root: /tmp/my-tfdbg-dumps)```

vitalyli commented 3 years ago

Anything in that thread dump? It's stuck somewhere likely in the driver. Just to be clear I have not experienced this problem with my GPUs, but it's something inside driver. May make sense to reach out to Nvidia.

tensorflow / ranking

Training gets stuck at first step when running tf_ranking_libsvm.py on a GPU #292