Training error occurs with edited skeleton

Xiaoyu-Tong commented 3 years ago

I have edited the skeleton, specifically, I added a new body-part to the skeleton and saved the change. And then when I tried to train the model for a second iteration (the first iteration magically worked), it gave me the error message attached below. Its worth mentioning that only topdown model training is failing, but the centroid model training works well. I have tried it both o nColab and at local, but neither works. Based on the error message, I think the reason maybe that the config is not updated about the new body-part set (as shown in the highlighted part of error message), but the dataset is updated, which causes the inconsistency of bodypart numbers.

Thank you!

Error message: 2021-03-29 21:31:33.511276: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 INFO:sleap.nn.training:Versions: SLEAP: 1.1.3 TensorFlow: 2.3.1 Numpy: 1.18.5 Python: 3.7.10 OS: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic INFO:sleap.nn.training:Training labels file: 2BMv3.pkg.slp INFO:sleap.nn.training:Training profile: centered_instance.json INFO:sleap.nn.training: INFO:sleap.nn.training:Arguments: INFO:sleap.nn.training:{ "training_job_path": "centered_instance.json", "labels_path": "2BMv3.pkg.slp", "video_paths": "", "val_labels": null, "test_labels": null, "tensorboard": false, "save_viz": false, "zmq": false, "run_name": "2BMv3.topdown_confmaps", "prefix": "", "suffix": "" } INFO:sleap.nn.training: INFO:sleap.nn.training:Training job: INFO:sleap.nn.training:{ "data": { "labels": { "training_labels": null, "validation_labels": null, "validation_fraction": 0.1, "test_labels": null, "split_by_inds": false, "training_inds": null, "validation_inds": null, "test_inds": null, "search_path_hints": [], "skeletons": [] }, "preprocessing": { "ensure_rgb": false, "ensure_grayscale": false, "imagenet_mode": null, "input_scaling": 1.0, "pad_to_stride": null, "resize_and_pad_to_target": true, "target_height": null, "target_width": null }, "instance_cropping": { "center_on_part": null, "crop_size": null, "crop_size_detection_padding": 16 } }, "model": { "backbone": { "leap": null, "unet": { "stem_stride": null, "max_stride": 32, "output_stride": 4, "filters": 24, "filters_rate": 1.5, "middle_block": true, "up_interpolate": true, "stacks": 1 }, "hourglass": null, "resnet": null, "pretrained_encoder": null }, "heads": { "single_instance": null, "centroid": null, "centered_instance": { "anchor_part": null, "part_names": null, "sigma": 5.0, "output_stride": 4, "offset_refinement": false }, "multi_instance": null } }, "optimization": { "preload_data": true, "augmentation_config": { "rotate": true, "rotation_min_angle": -180.0, "rotation_max_angle": 180.0, "translate": false, "translate_min": -5, "translate_max": 5, "scale": false, "scale_min": 0.9, "scale_max": 1.1, "uniform_noise": false, "uniform_noise_min_val": 0.0, "uniform_noise_max_val": 10.0, "gaussian_noise": false, "gaussian_noise_mean": 5.0, "gaussian_noise_stddev": 1.0, "contrast": false, "contrast_min_gamma": 0.5, "contrast_max_gamma": 2.0, "brightness": false, "brightness_min_val": 0.0, "brightness_max_val": 10.0, "random_crop": false, "random_crop_height": 256, "random_crop_width": 256, "random_flip": false, "flip_horizontal": true }, "online_shuffling": true, "shuffle_buffer_size": 128, "prefetch": true, "batch_size": 4, "batches_per_epoch": null, "min_batches_per_epoch": 200, "val_batches_per_epoch": null, "min_val_batches_per_epoch": 10, "epochs": 200, "optimizer": "adam", "initial_learning_rate": 0.0001, "learning_rate_schedule": { "reduce_on_plateau": true, "reduction_factor": 0.5, "plateau_min_delta": 1e-06, "plateau_patience": 5, "plateau_cooldown": 3, "min_learning_rate": 1e-08 }, "hard_keypoint_mining": { "online_mining": false, "hard_to_easy_ratio": 2.0, "min_hard_keypoints": 2, "max_hard_keypoints": null, "loss_scale": 5.0 }, "early_stopping": { "stop_training_on_plateau": true, "plateau_min_delta": 1e-06, "plateau_patience": 10 } }, "outputs": { "save_outputs": true, "run_name": "2BMv3.topdown_confmaps", "run_name_prefix": "", "run_name_suffix": ".centered_instance", "runs_folder": "", "tags": [ "" ], "save_visualizations": true, "delete_viz_images": true, "zip_outputs": false, "log_to_csv": true, "checkpointing": { "initial_model": false, "best_model": true, "every_epoch": false, "latest_model": false, "final_model": false }, "tensorboard": { "write_logs": false, "loss_frequency": "epoch", "architecture_graph": false, "profile_graph": false, "visualizations": true }, "zmq": { "subscribe_to_controller": false, "controller_address": "tcp://127.0.0.1:9000", "controller_polling_timeout": 10, "publish_updates": false, "publish_address": "tcp://127.0.0.1:9001" } }, "name": "", "description": "", "sleap_version": "1.1.3", "filename": "centered_instance.json" } INFO:sleap.nn.training: INFO:sleap.nn.training:System: 2021-03-29 21:31:35.434239: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1 2021-03-29 21:31:35.449759: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-29 21:31:35.450374: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:00:04.0 name: Tesla P100-PCIE-16GB computeCapability: 6.0 coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 15.90GiB deviceMemoryBandwidth: 681.88GiB/s 2021-03-29 21:31:35.450450: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2021-03-29 21:31:35.644331: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10 2021-03-29 21:31:35.651993: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10 2021-03-29 21:31:35.657744: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10 2021-03-29 21:31:35.676976: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10 2021-03-29 21:31:35.693797: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10 2021-03-29 21:31:36.093697: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7 2021-03-29 21:31:36.093951: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-29 21:31:36.094679: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-29 21:31:36.095212: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0 GPUs: 1/1 available Device: /physical_device:GPU:0 Available: True Initalized: False Memory growth: True INFO:sleap.nn.training: INFO:sleap.nn.training:Initializing trainer... INFO:sleap.nn.training:Loading training labels from: 2BMv3.pkg.slp INFO:sleap.nn.training:Creating training and validation splits from validation fraction: 0.1 INFO:sleap.nn.training: Splits: Training = 694 / Validation = 77. INFO:sleap.nn.training:Setting up for training... INFO:sleap.nn.training:Setting up pipeline builders... INFO:sleap.nn.training:Setting up model... INFO:sleap.nn.training:Building test pipeline... 2021-03-29 21:31:40.101930: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-03-29 21:31:40.115437: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2199995000 Hz 2021-03-29 21:31:40.130599: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x562079756fc0 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2021-03-29 21:31:40.130771: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2021-03-29 21:31:40.276944: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-29 21:31:40.277748: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x562079757180 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2021-03-29 21:31:40.277782: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla P100-PCIE-16GB, Compute Capability 6.0 2021-03-29 21:31:40.278048: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-29 21:31:40.278759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:00:04.0 name: Tesla P100-PCIE-16GB computeCapability: 6.0 coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 15.90GiB deviceMemoryBandwidth: 681.88GiB/s 2021-03-29 21:31:40.278851: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2021-03-29 21:31:40.278905: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10 2021-03-29 21:31:40.278935: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10 2021-03-29 21:31:40.278971: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10 2021-03-29 21:31:40.279000: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10 2021-03-29 21:31:40.279028: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10 2021-03-29 21:31:40.279055: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7 2021-03-29 21:31:40.279162: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-29 21:31:40.279772: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-29 21:31:40.280287: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0 2021-03-29 21:31:40.280373: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2021-03-29 21:31:40.873468: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-03-29 21:31:40.873531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0 2021-03-29 21:31:40.873544: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N 2021-03-29 21:31:40.873810: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-29 21:31:40.874489: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-29 21:31:40.875047: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14958 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0) INFO:sleap.nn.training:Loaded test example. [4.566s] INFO:sleap.nn.training: Input shape: (256, 256, 3) INFO:sleap.nn.training:Created Keras model. INFO:sleap.nn.training: Backbone: UNet(stacks=1, filters=24, filters_rate=1.5, kernel_size=3, stem_kernel_size=7, convs_per_block=2, stem_blocks=0, down_blocks=5, middle_block=True, up_blocks=3, up_interpolate=True, block_contraction=False) INFO:sleap.nn.training: Max stride: 32 INFO:sleap.nn.training: Parameters: 1,645,619 INFO:sleap.nn.training: Heads:

INFO:sleap.nn.training: [0] = CenteredInstanceConfmapsHead(part_names=['Ear_left', 'Ear_right', 'Nose', 'Head', 'Neck', 'Center', 'Lateral_left', 'Lateral_right', 'Tail_base'], anchor_part=None, sigma=5.0, output_stride=4, loss_weight=1.0)

INFO:sleap.nn.training: Outputs: INFO:sleap.nn.training: [0] = Tensor("CenteredInstanceConfmapsHead_0/BiasAdd:0", shape=(None, 64, 64, 9), dtype=float32) INFO:sleap.nn.training:Setting up data pipelines... INFO:sleap.nn.training:Training set: n = 694 INFO:sleap.nn.training:Validation set: n = 77 INFO:sleap.nn.training:Setting up optimization... INFO:sleap.nn.training: Learning rate schedule: LearningRateScheduleConfig(reduce_on_plateau=True, reduction_factor=0.5, plateau_min_delta=1e-06, plateau_patience=5, plateau_cooldown=3, min_learning_rate=1e-08) INFO:sleap.nn.training: Early stopping: EarlyStoppingConfig(stop_training_on_plateau=True, plateau_min_delta=1e-06, plateau_patience=10) INFO:sleap.nn.training:Setting up outputs... INFO:sleap.nn.training:Created run path: 2BMv3.topdown_confmaps.centered_instance INFO:sleap.nn.training:Setting up visualization... Unable to use Qt backend for matplotlib. This probably means Qt is running headless. INFO:sleap.nn.training:Finished trainer set up. [7.2s] INFO:sleap.nn.training:Creating tf.data.Datasets for training data generation... INFO:sleap.nn.training:Finished creating training datasets. [13.8s] INFO:sleap.nn.training:Starting training loop... Epoch 1/200 2021-03-29 21:32:01.235813: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7 2021-03-29 21:32:03.573000: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10 Traceback (most recent call last): File "/usr/local/bin/sleap-train", line 8, in sys.exit(main()) File "/usr/local/lib/python3.7/dist-packages/sleap/nn/training.py", line 1582, in main trainer.train() File "/usr/local/lib/python3.7/dist-packages/sleap/nn/training.py", line 892, in train verbose=2, File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper return method(self, *args, kwargs) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training.py", line 1098, in fit tmp_logs = train_function(iterator) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 780, in call result = self._call(*args, *kwds) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 807, in _call return self._stateless_fn(args, kwds) # pylint: disable=not-callable File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 2829, in call return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call cancellation_manager=cancellation_manager) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat ctx, args, cancellation_manager=cancellation_manager)) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 550, in call ctx=ctx) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute inputs, attrs, num_outputs)

tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [4,9,64,64] vs. [4,10,64,64]

 [[node loss_fn/mean_squared_error/SquaredDifference (defined at /lib/python3.7/dist-packages/sleap/nn/training.py:281) ]] [Op:__inference_train_function_24869]

Errors may have originated from an input operation. Input Source operations connected to node loss_fn/mean_squared_error/SquaredDifference: IteratorGetNext (defined at /lib/python3.7/dist-packages/sleap/nn/training.py:892)

Function call stack: train_function

arie-matsliah commented 3 years ago

Was your training data labeled before updating the skeleton or after?

Xiaoyu-Tong commented 3 years ago

Was your training data labeled before updating the skeleton or after?

Hi @ariematsliah-princeton , the training data was labeled after updating the skeleton. Or actually edited . Namely, I labeled with the first skeleton, then updated the skeleton, and then manually updated ALL the labeled frames to add a new body-part.

arie-matsliah commented 3 years ago

Thanks for clarifying, since this is not a common workflow it'll be easier to debug if you can share the project files (sleap@princeton.edu). We can take a look and perhaps suggest a workaround if it's not an easy fix.

Xiaoyu-Tong commented 3 years ago

OK, file sent. Thank you for your help in advance.

arie-matsliah commented 3 years ago

Hi @Xiaoyu-Tong

As noted earlier, SLEAP does not work with multiple skeletons well. As a workaround, we emailed you the corrected project file back.

For the record, this is how it was done: https://gist.github.com/ariematsliah-princeton/f872657ca785c5147ded558c16fc2707

Thanks for the patience

Xiaoyu-Tong commented 3 years ago

Hi @ariematsliah-princeton

Thank you! The corrected project can now be trained normally.

talmolab / sleap

Training error occurs with edited skeleton #536

INFO:sleap.nn.training: [0] = CenteredInstanceConfmapsHead(part_names=['Ear_left', 'Ear_right', 'Nose', 'Head', 'Neck', 'Center', 'Lateral_left', 'Lateral_right', 'Tail_base'], anchor_part=None, sigma=5.0, output_stride=4, loss_weight=1.0)

tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [4,9,64,64] vs. [4,10,64,64]