Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED

JoeAWilde commented 4 years ago

Hi,

I am running into a problem when trying to train SLEAP using GPU. The error comes after I hit the Run button on the training window. The first Epoch begins, but the timer freezes and when I check the anaconda window, the problems seems to be coming from: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED

I am running on Windows 10.0.18363 with a GTX 1660 SUPER. Below is the full output from the anaconda window:

(sleap_env) C:\Users\Joe>sleap-label
2020-09-03 13:51:27.409473: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
Resetting monitor window.
Polling: F:/OneDrive - University of Exeter/Pose estimation/SLEAP\models\single_test200903_135236.single_instance.21\viz\validation.*.png
Start training single_instance...
['sleap-train', 'C:\\Users\\Joe\\AppData\\Local\\Temp\\tmpj_wp00tn\\200903_135236_training_job.json', 'F:/OneDrive - University of Exeter/Pose estimation/SLEAP/test_project.slp', '--zmq', '--save_viz', '--tensorboard', '--video-paths', 'F:/OneDrive - University of Exeter/Pose estimation/SLEAP/mp4 wave.mp4']
2020-09-03 13:52:39.584900: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
INFO:sleap.nn.training:Training labels file: F:/OneDrive - University of Exeter/Pose estimation/SLEAP/test_project.slp
INFO:sleap.nn.training:Training profile: C:\Users\Joe\AppData\Local\Temp\tmpj_wp00tn\200903_135236_training_job.json
INFO:sleap.nn.training:
INFO:sleap.nn.training:Arguments:
INFO:sleap.nn.training:{
    "training_job_path": "C:\\Users\\Joe\\AppData\\Local\\Temp\\tmpj_wp00tn\\200903_135236_training_job.json",
    "labels_path": "F:/OneDrive - University of Exeter/Pose estimation/SLEAP/test_project.slp",
    "video_paths": "F:/OneDrive - University of Exeter/Pose estimation/SLEAP/mp4 wave.mp4",
    "val_labels": null,
    "test_labels": null,
    "tensorboard": true,
    "save_viz": true,
    "zmq": true,
    "run_name": "",
    "prefix": "",
    "suffix": ""
}
INFO:sleap.nn.training:
INFO:sleap.nn.training:Training job:
INFO:sleap.nn.training:{
    "data": {
        "labels": {
            "training_labels": null,
            "validation_labels": null,
            "validation_fraction": 0.1,
            "test_labels": null,
            "search_path_hints": [],
            "skeletons": []
        },
        "preprocessing": {
            "ensure_rgb": false,
            "ensure_grayscale": false,
            "imagenet_mode": null,
            "input_scaling": 1.0,
            "pad_to_stride": null
        },
        "instance_cropping": {
            "center_on_part": null,
            "crop_size": null,
            "crop_size_detection_padding": 16
        }
    },
    "model": {
        "backbone": {
            "leap": null,
            "unet": {
                "stem_stride": null,
                "max_stride": 32,
                "output_stride": 4,
                "filters": 16,
                "filters_rate": 2.0,
                "middle_block": true,
                "up_interpolate": true,
                "stacks": 1
            },
            "hourglass": null,
            "resnet": null
        },
        "heads": {
            "single_instance": {
                "part_names": null,
                "sigma": 5.0,
                "output_stride": 4
            },
            "centroid": null,
            "centered_instance": null,
            "multi_instance": null
        }
    },
    "optimization": {
        "preload_data": true,
        "augmentation_config": {
            "rotate": true,
            "rotation_min_angle": -180.0,
            "rotation_max_angle": 180.0,
            "translate": false,
            "translate_min": -5,
            "translate_max": 5,
            "scale": false,
            "scale_min": 0.9,
            "scale_max": 1.1,
            "uniform_noise": false,
            "uniform_noise_min_val": 0.0,
            "uniform_noise_max_val": 10.0,
            "gaussian_noise": false,
            "gaussian_noise_mean": 5.0,
            "gaussian_noise_stddev": 1.0,
            "contrast": false,
            "contrast_min_gamma": 0.5,
            "contrast_max_gamma": 2.0,
            "brightness": false,
            "brightness_min_val": 0.0,
            "brightness_max_val": 10.0
        },
        "online_shuffling": true,
        "shuffle_buffer_size": 128,
        "prefetch": true,
        "batch_size": 4,
        "batches_per_epoch": null,
        "min_batches_per_epoch": 200,
        "val_batches_per_epoch": null,
        "min_val_batches_per_epoch": 10,
        "epochs": 75,
        "optimizer": "adam",
        "initial_learning_rate": 0.0001,
        "learning_rate_schedule": {
            "reduce_on_plateau": true,
            "reduction_factor": 0.5,
            "plateau_min_delta": 1e-06,
            "plateau_patience": 5,
            "plateau_cooldown": 3,
            "min_learning_rate": 1e-08
        },
        "hard_keypoint_mining": {
            "online_mining": false,
            "hard_to_easy_ratio": 2.0,
            "min_hard_keypoints": 2,
            "max_hard_keypoints": null,
            "loss_scale": 5.0
        },
        "early_stopping": {
            "stop_training_on_plateau": true,
            "plateau_min_delta": 1e-06,
            "plateau_patience": 10
        }
    },
    "outputs": {
        "save_outputs": true,
        "run_name": "200903_135236.single_instance.21",
        "run_name_prefix": "single_test",
        "run_name_suffix": "",
        "runs_folder": "F:/OneDrive - University of Exeter/Pose estimation/SLEAP\\models",
        "tags": [
            ""
        ],
        "save_visualizations": true,
        "log_to_csv": true,
        "checkpointing": {
            "initial_model": false,
            "best_model": true,
            "every_epoch": false,
            "latest_model": false,
            "final_model": false
        },
        "tensorboard": {
            "write_logs": true,
            "loss_frequency": "epoch",
            "architecture_graph": false,
            "profile_graph": false,
            "visualizations": true
        },
        "zmq": {
            "subscribe_to_controller": true,
            "controller_address": "tcp://127.0.0.1:9000",
            "controller_polling_timeout": 10,
            "publish_updates": true,
            "publish_address": "tcp://127.0.0.1:9001"
        }
    }
}
INFO:sleap.nn.training:
INFO:sleap.nn.training:Initializing trainer...
video search paths:  ['F:/OneDrive - University of Exeter/Pose estimation/SLEAP/mp4 wave.mp4']
[Video(backend=MediaVideo(filename='F:/OneDrive - University of Exeter/Pose estimation/SLEAP/mp4 wave.mp4', grayscale=False, bgr=True, dataset='', input_format=''))]
INFO:sleap.nn.training:Setting up for training...
INFO:sleap.nn.training:Setting up pipeline builders...
INFO:sleap.nn.training:Setting up model...
INFO:sleap.nn.training:Building test pipeline...
2020-09-03 13:52:42.817292: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-09-03 13:52:42.869281: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1660 SUPER computeCapability: 7.5
coreClock: 1.83GHz coreCount: 22 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 312.97GiB/s
2020-09-03 13:52:42.878761: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-09-03 13:52:42.898613: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-09-03 13:52:42.925128: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-09-03 13:52:42.933458: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-09-03 13:52:42.952424: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-09-03 13:52:42.963969: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-09-03 13:52:43.011865: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-09-03 13:52:43.016517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-09-03 13:52:43.023238: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2020-09-03 13:52:43.032389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1660 SUPER computeCapability: 7.5
coreClock: 1.83GHz coreCount: 22 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 312.97GiB/s
2020-09-03 13:52:43.042648: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-09-03 13:52:43.048854: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-09-03 13:52:43.054194: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-09-03 13:52:43.059211: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-09-03 13:52:43.064287: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-09-03 13:52:43.070704: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-09-03 13:52:43.076904: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-09-03 13:52:43.081903: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-09-03 13:52:45.659123: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-03 13:52:45.665172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0
2020-09-03 13:52:45.668850: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N
2020-09-03 13:52:45.674492: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4628 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1660 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
INFO:sleap.nn.training:Loaded test example. [5.407s]
INFO:sleap.nn.training:  Input shape: (1088, 1920, 3)
INFO:sleap.nn.training:Created Keras model.
INFO:sleap.nn.training:  Backbone: UNet(stacks=1, filters=16, filters_rate=2.0, kernel_size=3, stem_kernel_size=7, convs_per_block=2, stem_blocks=0, down_blocks=5, middle_block=True, up_blocks=3, up_interpolate=True, block_contraction=False)
INFO:sleap.nn.training:  Max stride: 32
INFO:sleap.nn.training:  Parameters: 7,816,598
INFO:sleap.nn.training:  Heads:
INFO:sleap.nn.training:  heads[0] = SingleInstanceConfmapsHead(part_names=['maj_tip', 'maj_elbow', 'maj_shoulder', 'min_shoulder', 'min_elbow', 'min_tip'], sigma=5.0, output_stride=4, loss_weight=1.0)
INFO:sleap.nn.training:Setting up data pipelines...
INFO:sleap.nn.training:Training set: n = 18
INFO:sleap.nn.training:Validation set: n = 3
INFO:sleap.nn.training:Setting up optimization...
INFO:sleap.nn.training:  Learning rate schedule: LearningRateScheduleConfig(reduce_on_plateau=True, reduction_factor=0.5, plateau_min_delta=1e-06, plateau_patience=5, plateau_cooldown=3, min_learning_rate=1e-08)
INFO:sleap.nn.training:  Early stopping: EarlyStoppingConfig(stop_training_on_plateau=True, plateau_min_delta=1e-06, plateau_patience=10)
INFO:sleap.nn.training:Setting up outputs...
INFO:sleap.nn.callbacks:Training controller subscribed to: tcp://127.0.0.1:9000 (topic: )
INFO:sleap.nn.training:  ZMQ controller subcribed to: tcp://127.0.0.1:9000
INFO:sleap.nn.callbacks:Progress reporter publishing on: tcp://127.0.0.1:9001 for: not_set
INFO:sleap.nn.training:  ZMQ progress reporter publish on: tcp://127.0.0.1:9001
INFO:sleap.nn.training:Created run path: F:/OneDrive - University of Exeter/Pose estimation/SLEAP\models\single_test200903_135236.single_instance.21
INFO:sleap.nn.training:Setting up visualization...
INFO:sleap.nn.training:Finished trainer set up. [9.6s]
INFO:sleap.nn.training:Creating tf.data.Datasets for training data generation...
INFO:sleap.nn.training:Finished creating training datasets. [9.7s]
INFO:sleap.nn.training:Starting training loop...
Train for 200 steps, validate for 10 steps
Epoch 1/75
2020-09-03 13:53:04.505119: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-09-03 13:53:06.684305: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2020-09-03 13:53:06.771857: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2020-09-03 13:53:06.778982: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[{{node model/stack0_enc0_conv0/Conv2D}}]]
WARNING:tensorflow:Reduce LR on plateau conditioned on metric `val_loss` which is not available. Available metrics are: lr
Traceback (most recent call last):
  File "C:\Users\Joe\anaconda3\envs\sleap_env\lib\site-packages\sleap\nn\monitor.py", line 401, in check_messages
    msg["logs"]["loss"],
KeyError: 'loss'
2020-09-03 13:53:07.330078: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2020-09-03 13:53:07.371307: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2020-09-03 13:53:07.382405: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at iterator_ops.cc:941 : Unknown: 2 root error(s) found.
  (0) Unknown: {{function_node __inference_Dataset_map_predict_1111}} Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[{{node model/stack0_enc0_conv0/Conv2D}}]]
  (1) Unknown: {{function_node __inference_Dataset_map_predict_1111}} Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[{{node model/stack0_enc0_conv0/Conv2D}}]]
         [[cond/output/_6/_160]]
0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
  File "C:\Users\Joe\anaconda3\envs\sleap_env\lib\site-packages\tensorflow_core\python\eager\context.py", line 1897, in execution_mode
    yield
  File "C:\Users\Joe\anaconda3\envs\sleap_env\lib\site-packages\tensorflow_core\python\data\ops\iterator_ops.py", line 659, in _next_internal
    output_shapes=self._flat_output_shapes)
  File "C:\Users\Joe\anaconda3\envs\sleap_env\lib\site-packages\tensorflow_core\python\ops\gen_dataset_ops.py", line 2479, in iterator_get_next_sync
    _ops.raise_from_not_ok_status(e, name)
  File "C:\Users\Joe\anaconda3\envs\sleap_env\lib\site-packages\tensorflow_core\python\framework\ops.py", line 6606, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: {{function_node __inference_Dataset_map_predict_1111}} Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[{{node model/stack0_enc0_conv0/Conv2D}}]]
  (1) Unknown: {{function_node __inference_Dataset_map_predict_1111}} Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[{{node model/stack0_enc0_conv0/Conv2D}}]]
         [[cond/output/_6/_160]]
0 successful operations.
0 derived errors ignored. [Op:IteratorGetNextSync]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Joe\anaconda3\envs\sleap_env\Scripts\sleap-train-script.py", line 33, in <module>
    sys.exit(load_entry_point('sleap==1.0.8', 'console_scripts', 'sleap-train')())
  File "C:\Users\Joe\anaconda3\envs\sleap_env\lib\site-packages\sleap\nn\training.py", line 1371, in main
    trainer.train()
  File "C:\Users\Joe\anaconda3\envs\sleap_env\lib\site-packages\sleap\nn\training.py", line 812, in train
    verbose=2,
  File "C:\Users\Joe\anaconda3\envs\sleap_env\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 819, in fit
    use_multiprocessing=use_multiprocessing)
  File "C:\Users\Joe\anaconda3\envs\sleap_env\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 397, in fit
    prefix='val_')
  File "C:\Users\Joe\anaconda3\envs\sleap_env\lib\contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "C:\Users\Joe\anaconda3\envs\sleap_env\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 771, in on_epoch
    self.callbacks.on_epoch_end(epoch, epoch_logs)
  File "C:\Users\Joe\anaconda3\envs\sleap_env\lib\site-packages\tensorflow_core\python\keras\callbacks.py", line 302, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "C:\Users\Joe\anaconda3\envs\sleap_env\lib\site-packages\sleap\nn\callbacks.py", line 280, in on_epoch_end
    figure = self.plot_fn()
  File "C:\Users\Joe\anaconda3\envs\sleap_env\lib\site-packages\sleap\nn\training.py", line 1022, in <lambda>
    viz_fn=lambda: visualize_example(next(training_viz_ds_iter)),
  File "C:\Users\Joe\anaconda3\envs\sleap_env\lib\site-packages\tensorflow_core\python\data\ops\iterator_ops.py", line 630, in __next__
    return self.next()
  File "C:\Users\Joe\anaconda3\envs\sleap_env\lib\site-packages\tensorflow_core\python\data\ops\iterator_ops.py", line 674, in next
    return self._next_internal()
  File "C:\Users\Joe\anaconda3\envs\sleap_env\lib\site-packages\tensorflow_core\python\data\ops\iterator_ops.py", line 665, in _next_internal
    return structure.from_compatible_tensor_list(self._element_spec, ret)
  File "C:\Users\Joe\anaconda3\envs\sleap_env\lib\contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "C:\Users\Joe\anaconda3\envs\sleap_env\lib\site-packages\tensorflow_core\python\eager\context.py", line 1900, in execution_mode
    executor_new.wait()
  File "C:\Users\Joe\anaconda3\envs\sleap_env\lib\site-packages\tensorflow_core\python\eager\executor.py", line 67, in wait
    pywrap_tensorflow.TFE_ExecutorWaitForAllPendingNodes(self._handle)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: {{function_node __inference_Dataset_map_predict_1111}} Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[{{node model/stack0_enc0_conv0/Conv2D}}]]
  (1) Unknown: {{function_node __inference_Dataset_map_predict_1111}} Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[{{node model/stack0_enc0_conv0/Conv2D}}]]
         [[cond/output/_6/_160]]
0 successful operations.
0 derived errors ignored.
INFO:sleap.nn.callbacks:Closing the reporter controller/context.
INFO:sleap.nn.callbacks:Closing the training controller socket/context.

talmo commented 4 years ago

Hi @JoeAWilde,

Your GPU might be a little short on memory for your frame size. Try this:

Reboot just to make sure there's nothing hogging your GPU memory in the background.
Try reducing the batch size to 1 for training.
Try reducing the input scaling to 0.75 or 0.5 if your images have enough resolution.
Try reducing the max stride of the model to 16.
If none of that works, switch to the top-down approach. It still works in the single animal case and may save some memory if your animal is small relative to the size of the frame.

JoeAWilde commented 4 years ago

Hi @talmo

Thanks for the response! I have tried your suggestions (reducing batch size to 1, reducing scaling to 0.5, reducing max stride to 16, switching to top-down approach) and I keep getting the same problem. I have also cropped the test video I am using so the animal is larger relative to frame size, and I have lowered the resolution of the video. All of these result in the same problem outlined above.

talmo commented 4 years ago

Hi @JoeAWilde,

I see. One more thing to try then -- try updating to SLEAP 1.0.9:

pip install --upgrade sleap==1.0.9

This should incorporate a change where the GPU memory is no longer pre-allocated by default, which I believe should prevent this issue on the 1660. Give it a shot and let me know if it's still not working.

Cheers,

Talmo

JoeAWilde commented 4 years ago

@talmo

I upgraded to Sleap 1.9.0 and it's now working! Thanks for the help

Cheers, Joe.

talmolab / sleap

Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED #395