training doesn't proceed past epoch 1 on GPU workstation

panichem commented 2 years ago

Hey @talmo !

I'm working through the sleap tutorial on a PC with a decent GPU. The initial training step is taking a bit of time though...:

Here's the dump from terminal - can't really see anything weird. Any idea what I need to do differently?

(sleap) C:\Users\moorelab>sleap-label
Saving config: C:\Users\moorelab/.sleap/1.2.3/preferences.yaml
Restoring GUI state...

Software versions:
SLEAP: 1.2.3
TensorFlow: 2.6.3
Numpy: 1.19.5
Python: 3.7.12
OS: Windows-10-10.0.19041-SP0

Happy SLEAPing! :)
Resetting monitor window.
Polling: C:/Users/moorelab/Dropbox/Experiments/MooreLab/poseEstimation\models\220516_200359.single_instance.n=26\viz\validation.*.png
Start training single_instance...
['sleap-train', 'C:\\Users\\moorelab\\AppData\\Local\\Temp\\tmpgdzdwt6i\\220516_200359_training_job.json', 'C:/Users/moorelab/Dropbox/Experiments/MooreLab/poseEstimation/JoplinLick.v000.slp', '--zmq', '--save_viz']
INFO:sleap.nn.training:Versions:
SLEAP: 1.2.3
TensorFlow: 2.6.3
Numpy: 1.19.5
Python: 3.7.12
OS: Windows-10-10.0.19041-SP0
INFO:sleap.nn.training:Training labels file: C:/Users/moorelab/Dropbox/Experiments/MooreLab/poseEstimation/JoplinLick.v000.slp
INFO:sleap.nn.training:Training profile: C:\Users\moorelab\AppData\Local\Temp\tmpgdzdwt6i\220516_200359_training_job.json
INFO:sleap.nn.training:
INFO:sleap.nn.training:Arguments:
INFO:sleap.nn.training:{
    "training_job_path": "C:\\Users\\moorelab\\AppData\\Local\\Temp\\tmpgdzdwt6i\\220516_200359_training_job.json",
    "labels_path": "C:/Users/moorelab/Dropbox/Experiments/MooreLab/poseEstimation/JoplinLick.v000.slp",
    "video_paths": [
        ""
    ],
    "val_labels": null,
    "test_labels": null,
    "tensorboard": false,
    "save_viz": true,
    "zmq": true,
    "run_name": "",
    "prefix": "",
    "suffix": "",
    "cpu": false,
    "first_gpu": false,
    "last_gpu": false,
    "gpu": 0
}
INFO:sleap.nn.training:
INFO:sleap.nn.training:Training job:
INFO:sleap.nn.training:{
    "data": {
        "labels": {
            "training_labels": null,
            "validation_labels": null,
            "validation_fraction": 0.1,
            "test_labels": null,
            "split_by_inds": false,
            "training_inds": null,
            "validation_inds": null,
            "test_inds": null,
            "search_path_hints": [],
            "skeletons": []
        },
        "preprocessing": {
            "ensure_rgb": false,
            "ensure_grayscale": false,
            "imagenet_mode": null,
            "input_scaling": 1.0,
            "pad_to_stride": null,
            "resize_and_pad_to_target": true,
            "target_height": null,
            "target_width": null
        },
        "instance_cropping": {
            "center_on_part": null,
            "crop_size": null,
            "crop_size_detection_padding": 16
        }
    },
    "model": {
        "backbone": {
            "leap": null,
            "unet": {
                "stem_stride": null,
                "max_stride": 16,
                "output_stride": 2,
                "filters": 16,
                "filters_rate": 2.0,
                "middle_block": true,
                "up_interpolate": true,
                "stacks": 1
            },
            "hourglass": null,
            "resnet": null,
            "pretrained_encoder": null
        },
        "heads": {
            "single_instance": {
                "part_names": null,
                "sigma": 2.5,
                "output_stride": 2,
                "loss_weight": 1.0,
                "offset_refinement": false
            },
            "centroid": null,
            "centered_instance": null,
            "multi_instance": null,
            "multi_class_bottomup": null,
            "multi_class_topdown": null
        }
    },
    "optimization": {
        "preload_data": true,
        "augmentation_config": {
            "rotate": true,
            "rotation_min_angle": -15.0,
            "rotation_max_angle": 15.0,
            "translate": false,
            "translate_min": -5,
            "translate_max": 5,
            "scale": false,
            "scale_min": 0.9,
            "scale_max": 1.1,
            "uniform_noise": false,
            "uniform_noise_min_val": 0.0,
            "uniform_noise_max_val": 10.0,
            "gaussian_noise": false,
            "gaussian_noise_mean": 5.0,
            "gaussian_noise_stddev": 1.0,
            "contrast": false,
            "contrast_min_gamma": 0.5,
            "contrast_max_gamma": 2.0,
            "brightness": false,
            "brightness_min_val": 0.0,
            "brightness_max_val": 10.0,
            "random_crop": false,
            "random_crop_height": 256,
            "random_crop_width": 256,
            "random_flip": false,
            "flip_horizontal": true
        },
        "online_shuffling": true,
        "shuffle_buffer_size": 128,
        "prefetch": true,
        "batch_size": 4,
        "batches_per_epoch": null,
        "min_batches_per_epoch": 200,
        "val_batches_per_epoch": null,
        "min_val_batches_per_epoch": 10,
        "epochs": 200,
        "optimizer": "adam",
        "initial_learning_rate": 0.0001,
        "learning_rate_schedule": {
            "reduce_on_plateau": true,
            "reduction_factor": 0.5,
            "plateau_min_delta": 1e-06,
            "plateau_patience": 5,
            "plateau_cooldown": 3,
            "min_learning_rate": 1e-08
        },
        "hard_keypoint_mining": {
            "online_mining": false,
            "hard_to_easy_ratio": 2.0,
            "min_hard_keypoints": 2,
            "max_hard_keypoints": null,
            "loss_scale": 5.0
        },
        "early_stopping": {
            "stop_training_on_plateau": true,
            "plateau_min_delta": 1e-08,
            "plateau_patience": 10
        }
    },
    "outputs": {
        "save_outputs": true,
        "run_name": "220516_200359.single_instance.n=26",
        "run_name_prefix": "",
        "run_name_suffix": "",
        "runs_folder": "C:/Users/moorelab/Dropbox/Experiments/MooreLab/poseEstimation\\models",
        "tags": [
            ""
        ],
        "save_visualizations": true,
        "delete_viz_images": true,
        "zip_outputs": false,
        "log_to_csv": true,
        "checkpointing": {
            "initial_model": false,
            "best_model": true,
            "every_epoch": false,
            "latest_model": false,
            "final_model": false
        },
        "tensorboard": {
            "write_logs": false,
            "loss_frequency": "epoch",
            "architecture_graph": false,
            "profile_graph": false,
            "visualizations": true
        },
        "zmq": {
            "subscribe_to_controller": true,
            "controller_address": "tcp://127.0.0.1:9000",
            "controller_polling_timeout": 10,
            "publish_updates": true,
            "publish_address": "tcp://127.0.0.1:9001"
        }
    },
    "name": "",
    "description": "",
    "sleap_version": "1.2.3",
    "filename": "C:\\Users\\moorelab\\AppData\\Local\\Temp\\tmpgdzdwt6i\\220516_200359_training_job.json"
}
INFO:sleap.nn.training:
INFO:sleap.nn.training:Using GPU 0 for acceleration.
INFO:sleap.nn.training:Disabled GPU memory pre-allocation.
INFO:sleap.nn.training:System:
GPUs: 1/1 available
  Device: /physical_device:GPU:0
         Available: True
        Initalized: False
     Memory growth: True
INFO:sleap.nn.training:
INFO:sleap.nn.training:Initializing trainer...
INFO:sleap.nn.training:Loading training labels from: C:/Users/moorelab/Dropbox/Experiments/MooreLab/poseEstimation/JoplinLick.v000.slp
INFO:sleap.nn.training:Creating training and validation splits from validation fraction: 0.1
INFO:sleap.nn.training:  Splits: Training = 23 / Validation = 3.
INFO:sleap.nn.training:Setting up for training...
INFO:sleap.nn.training:Setting up pipeline builders...
INFO:sleap.nn.training:Setting up model...
INFO:sleap.nn.training:Building test pipeline...
2022-05-16 20:04:03.657263: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-05-16 20:04:04.078266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5979 MB memory:  -> device: 0, name: GeForce RTX 2070 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5
2022-05-16 20:04:04.576879: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
INFO:sleap.nn.training:Loaded test example. [2.101s]
INFO:sleap.nn.training:  Input shape: (1088, 1920, 3)
INFO:sleap.nn.training:Created Keras model.
INFO:sleap.nn.training:  Backbone: UNet(stacks=1, filters=16, filters_rate=2.0, kernel_size=3, stem_kernel_size=7, convs_per_block=2, stem_blocks=0, down_blocks=4, middle_block=True, up_blocks=3, up_interpolate=True, block_contraction=False)
INFO:sleap.nn.training:  Max stride: 16
INFO:sleap.nn.training:  Parameters: 1,953,624
INFO:sleap.nn.training:  Heads:
INFO:sleap.nn.training:    [0] = SingleInstanceConfmapsHead(part_names=['mouth', 'eyes', 'chest', 'l_hand', 'r_hand', 'l_foot', 'r_foot', 'tongue'], sigma=2.5, output_stride=2, loss_weight=1.0)
INFO:sleap.nn.training:  Outputs:
INFO:sleap.nn.training:    [0] = KerasTensor(type_spec=TensorSpec(shape=(None, 544, 960, 8), dtype=tf.float32, name=None), name='SingleInstanceConfmapsHead/BiasAdd:0', description="created by layer 'SingleInstanceConfmapsHead'")
INFO:sleap.nn.training:Setting up data pipelines...
INFO:sleap.nn.training:Training set: n = 23
INFO:sleap.nn.training:Validation set: n = 3
INFO:sleap.nn.training:Setting up optimization...
INFO:sleap.nn.training:  Learning rate schedule: LearningRateScheduleConfig(reduce_on_plateau=True, reduction_factor=0.5, plateau_min_delta=1e-06, plateau_patience=5, plateau_cooldown=3, min_learning_rate=1e-08)
INFO:sleap.nn.training:  Early stopping: EarlyStoppingConfig(stop_training_on_plateau=True, plateau_min_delta=1e-08, plateau_patience=10)
INFO:sleap.nn.training:Setting up outputs...
INFO:sleap.nn.callbacks:Training controller subscribed to: tcp://127.0.0.1:9000 (topic: )
INFO:sleap.nn.training:  ZMQ controller subcribed to: tcp://127.0.0.1:9000
INFO:sleap.nn.callbacks:Progress reporter publishing on: tcp://127.0.0.1:9001 for: not_set
INFO:sleap.nn.training:  ZMQ progress reporter publish on: tcp://127.0.0.1:9001
INFO:sleap.nn.training:Created run path: C:/Users/moorelab/Dropbox/Experiments/MooreLab/poseEstimation\models\220516_200359.single_instance.n=26
INFO:sleap.nn.training:Setting up visualization...
INFO:sleap.nn.training:Finished trainer set up. [2.7s]
INFO:sleap.nn.training:Creating tf.data.Datasets for training data generation...
INFO:sleap.nn.training:Finished creating training datasets. [5.7s]
INFO:sleap.nn.training:Starting training loop...
Epoch 1/200
2022-05-16 20:04:14.712217: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8201
Saving config: C:\Users\moorelab/.sleap/1.2.3/preferences.yaml

talmo commented 2 years ago

Hey @panichem!

Are you able to use the sample dataset from the tutorial?

If so, or if you just want to give it a quick try, does training a bottom-up multi-animal model work? (This works for single animals as well.)

Give those a spin and if neither works, do you mind sharing the video + .slp file with talmo@salk.edu?

Talmo

roomrys commented 2 years ago

Hi @panichem,

I also came across this when I created a model where the receptive field (RF) size was relatively small compared to the overall frame size. You could try lowering the input scaling to ~0.5 (which increases the RF size) and see how that effects the first epoch training time. Please let us know if any of these solutions worked.

Thanks, Liezl

panichem commented 2 years ago

@talmo @roomrys - I switched to a bottom-up model and changed the RF scaling to .5 and now the first ~10 epochs are done in a few minutes. Thanks for your help!!

roomrys commented 2 years ago

Marking this as a TODO since there is a work-around, but we still need to find the root cause (and prevent it from happening)

talmo commented 2 years ago

The fact that there weren't any errors and that training didn't even start makes me think it's a tensorflow deadlock.

We've run into this in the past (see attempted fixes in https://github.com/talmolab/sleap/commit/613c20119e992a0a3309cf0e99a8648cc6818cb0 and https://github.com/talmolab/sleap/commit/492b67b6b0325fa0f46e6abcbf7fef5e580a5bde). I think it's related to how we use tf.py_function -- there's a thread about it over in https://github.com/tensorflow/tensorflow/issues/32454, but no solution.

In the past I've had a hard time reliably reproducing this -- it seems to be stochastic and maybe system-dependent -- so maybe let's just close this for now and revisit it if more people are having the same problem.

Also moving this to Discussions so folks see it when asking q's.

Thanks for the report @panichem!

talmolab / sleap

training doesn't proceed past epoch 1 on GPU workstation #751