talmolab / sleap

A deep learning framework for multi-animal pose tracking.
https://sleap.ai
Other
432 stars 96 forks source link

Size issue when training centroid models persists #528

Closed Xiaoyu-Tong closed 3 years ago

Xiaoyu-Tong commented 3 years ago

Hi @talmo ,

I have tried version 1.1.2 but unfortunately it is still not working. I have tried to train a model using the new version. I have also tried to export a new training package in the new version and then try to train the model. But neither helps.

It works smoothly when I tried to train the topdown confidence map. But when I tried to train the centroid model by running "!sleap-train baseline.centroid.json 2BMv2_ImplantBranch.pkg.slp --run_name "2BMv2_ImplantBranch.centroid", it gave me the error attached below. Is this error not the same thing as the frame size you mentioned before? And while you investigate this error, is there any possible workaround for this? Thank you!

2021-03-25 14:32:25.338509: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 INFO:sleap.nn.training:Versions: SLEAP: 1.1.2 TensorFlow: 2.3.1 Numpy: 1.18.5 Python: 3.7.10 OS: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic INFO:sleap.nn.training:Training labels file: 2BMv2_ImplantBranch.pkg.slp INFO:sleap.nn.training:Training profile: /usr/local/lib/python3.7/dist-packages/sleap/training_profiles/baseline.centroid.json INFO:sleap.nn.training: INFO:sleap.nn.training:Arguments: INFO:sleap.nn.training:{ "training_job_path": "baseline.centroid.json", "labels_path": "2BMv2_ImplantBranch.pkg.slp", "video_paths": "", "val_labels": null, "test_labels": null, "tensorboard": false, "save_viz": false, "zmq": false, "run_name": "2BMv2_ImplantBranch.centroid", "prefix": "", "suffix": "" } INFO:sleap.nn.training: INFO:sleap.nn.training:Training job: INFO:sleap.nn.training:{ "data": { "labels": { "training_labels": null, "validation_labels": null, "validation_fraction": 0.1, "test_labels": null, "split_by_inds": false, "training_inds": null, "validation_inds": null, "test_inds": null, "search_path_hints": [], "skeletons": [] }, "preprocessing": { "ensure_rgb": false, "ensure_grayscale": false, "imagenet_mode": null, "input_scaling": 0.5, "pad_to_stride": null, "resize_and_pad_to_target": true, "target_height": null, "target_width": null }, "instance_cropping": { "center_on_part": null, "crop_size": null, "crop_size_detection_padding": 16 } }, "model": { "backbone": { "leap": null, "unet": { "stem_stride": null, "max_stride": 16, "output_stride": 2, "filters": 16, "filters_rate": 2.0, "middle_block": true, "up_interpolate": true, "stacks": 1 }, "hourglass": null, "resnet": null, "pretrained_encoder": null }, "heads": { "single_instance": null, "centroid": { "anchor_part": null, "sigma": 5.0, "output_stride": 2, "offset_refinement": false }, "centered_instance": null, "multi_instance": null } }, "optimization": { "preload_data": true, "augmentation_config": { "rotate": true, "rotation_min_angle": -180.0, "rotation_max_angle": 180.0, "translate": false, "translate_min": -5, "translate_max": 5, "scale": false, "scale_min": 0.9, "scale_max": 1.1, "uniform_noise": false, "uniform_noise_min_val": 0.0, "uniform_noise_max_val": 10.0, "gaussian_noise": false, "gaussian_noise_mean": 5.0, "gaussian_noise_stddev": 1.0, "contrast": false, "contrast_min_gamma": 0.5, "contrast_max_gamma": 2.0, "brightness": false, "brightness_min_val": 0.0, "brightness_max_val": 10.0, "random_crop": false, "random_crop_height": 256, "random_crop_width": 256, "random_flip": false, "flip_horizontal": true }, "online_shuffling": true, "shuffle_buffer_size": 128, "prefetch": true, "batch_size": 4, "batches_per_epoch": null, "min_batches_per_epoch": 200, "val_batches_per_epoch": null, "min_val_batches_per_epoch": 10, "epochs": 200, "optimizer": "adam", "initial_learning_rate": 0.0001, "learning_rate_schedule": { "reduce_on_plateau": true, "reduction_factor": 0.5, "plateau_min_delta": 1e-06, "plateau_patience": 5, "plateau_cooldown": 3, "min_learning_rate": 1e-08 }, "hard_keypoint_mining": { "online_mining": false, "hard_to_easy_ratio": 2.0, "min_hard_keypoints": 2, "max_hard_keypoints": null, "loss_scale": 5.0 }, "early_stopping": { "stop_training_on_plateau": true, "plateau_min_delta": 1e-06, "plateau_patience": 10 } }, "outputs": { "save_outputs": true, "run_name": "2BMv2_ImplantBranch.centroid", "run_name_prefix": "", "run_name_suffix": null, "runs_folder": "models", "tags": [], "save_visualizations": true, "delete_viz_images": true, "zip_outputs": false, "log_to_csv": true, "checkpointing": { "initial_model": false, "best_model": true, "every_epoch": false, "latest_model": false, "final_model": false }, "tensorboard": { "write_logs": false, "loss_frequency": "epoch", "architecture_graph": false, "profile_graph": false, "visualizations": true }, "zmq": { "subscribe_to_controller": false, "controller_address": "tcp://127.0.0.1:9000", "controller_polling_timeout": 10, "publish_updates": false, "publish_address": "tcp://127.0.0.1:9001" } }, "name": "", "description": "", "sleap_version": "1.1.2", "filename": "/usr/local/lib/python3.7/dist-packages/sleap/training_profiles/baseline.centroid.json" } INFO:sleap.nn.training: INFO:sleap.nn.training:System: 2021-03-25 14:32:27.585333: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1 2021-03-25 14:32:27.604656: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-25 14:32:27.605625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:00:04.0 name: Tesla P100-PCIE-16GB computeCapability: 6.0 coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 15.90GiB deviceMemoryBandwidth: 681.88GiB/s 2021-03-25 14:32:27.605701: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2021-03-25 14:32:27.804232: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10 2021-03-25 14:32:27.811176: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10 2021-03-25 14:32:27.817650: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10 2021-03-25 14:32:27.841652: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10 2021-03-25 14:32:27.861168: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10 2021-03-25 14:32:28.284821: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7 2021-03-25 14:32:28.285120: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-25 14:32:28.286169: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-25 14:32:28.287043: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0 GPUs: 1/1 available Device: /physical_device:GPU:0 Available: True Initalized: False Memory growth: True INFO:sleap.nn.training: INFO:sleap.nn.training:Initializing trainer... INFO:sleap.nn.training:Loading training labels from: 2BMv2_ImplantBranch.pkg.slp INFO:sleap.nn.training:Creating training and validation splits from validation fraction: 0.1 INFO:sleap.nn.training: Splits: Training = 581 / Validation = 64. INFO:sleap.nn.training:Setting up for training... INFO:sleap.nn.training:Setting up pipeline builders... INFO:sleap.nn.training:Setting up model... INFO:sleap.nn.training:Building test pipeline... 2021-03-25 14:32:32.894853: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-03-25 14:32:32.905229: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2199995000 Hz 2021-03-25 14:32:32.905660: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5599631ccfc0 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2021-03-25 14:32:32.905704: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2021-03-25 14:32:33.030127: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-25 14:32:33.031445: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5599631cda40 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2021-03-25 14:32:33.031496: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla P100-PCIE-16GB, Compute Capability 6.0 2021-03-25 14:32:33.031779: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-25 14:32:33.032702: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:00:04.0 name: Tesla P100-PCIE-16GB computeCapability: 6.0 coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 15.90GiB deviceMemoryBandwidth: 681.88GiB/s 2021-03-25 14:32:33.032812: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2021-03-25 14:32:33.032906: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10 2021-03-25 14:32:33.032966: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10 2021-03-25 14:32:33.033026: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10 2021-03-25 14:32:33.033116: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10 2021-03-25 14:32:33.033190: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10 2021-03-25 14:32:33.033272: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7 2021-03-25 14:32:33.033413: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-25 14:32:33.034362: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-25 14:32:33.035247: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0 2021-03-25 14:32:33.035338: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2021-03-25 14:32:34.008489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-03-25 14:32:34.008562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0 2021-03-25 14:32:34.008589: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N 2021-03-25 14:32:34.008906: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-25 14:32:34.009931: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-25 14:32:34.010912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14958 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0) INFO:sleap.nn.training:Loaded test example. [4.671s] INFO:sleap.nn.training: Input shape: (256, 336, 3) INFO:sleap.nn.training:Created Keras model. INFO:sleap.nn.training: Backbone: UNet(stacks=1, filters=16, filters_rate=2.0, kernel_size=3, stem_kernel_size=7, convs_per_block=2, stem_blocks=0, down_blocks=4, middle_block=True, up_blocks=3, up_interpolate=True, block_contraction=False) INFO:sleap.nn.training: Max stride: 16 INFO:sleap.nn.training: Parameters: 1,953,393 INFO:sleap.nn.training: Heads: INFO:sleap.nn.training: [0] = CentroidConfmapsHead(anchor_part=None, sigma=5.0, output_stride=2, loss_weight=1.0) INFO:sleap.nn.training: Outputs: INFO:sleap.nn.training: [0] = Tensor("CentroidConfmapsHead_0/BiasAdd:0", shape=(None, 128, 168, 1), dtype=float32) INFO:sleap.nn.training:Setting up data pipelines... INFO:sleap.nn.training:Training set: n = 581 INFO:sleap.nn.training:Validation set: n = 64 INFO:sleap.nn.training:Setting up optimization... INFO:sleap.nn.training: Learning rate schedule: LearningRateScheduleConfig(reduce_on_plateau=True, reduction_factor=0.5, plateau_min_delta=1e-06, plateau_patience=5, plateau_cooldown=3, min_learning_rate=1e-08) INFO:sleap.nn.training: Early stopping: EarlyStoppingConfig(stop_training_on_plateau=True, plateau_min_delta=1e-06, plateau_patience=10) INFO:sleap.nn.training:Setting up outputs... INFO:sleap.nn.training:Created run path: models/2BMv2_ImplantBranch.centroid_2 INFO:sleap.nn.training:Setting up visualization... WARNING:tensorflow:Model was constructed with shape (None, 256, 336, 3) for input Tensor("input:0", shape=(None, 256, 336, 3), dtype=float32), but it was called on an input with incompatible shape (None, 492, 656, 3). Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 1812, in _create_c_op c_op = pywrap_tf_session.TF_FinishOperation(op_desc) tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimension 1 in both shapes must be equal, but are 123 and 124. Shapes are [?,123,164] and [?,124,164]. for '{{node functional_1/stack0_dec1_s8_to_s4_skip_concat/concat}} = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32](functional_1/stack0_enc2_act1_relu/Relu, functional_1/stack0_dec1_s8_to_s4_interp_bilinear/resize/ResizeBilinear, functional_1/stack0_dec1_s8_to_s4_skip_concat/concat/axis)' with input shapes: [?,123,164,64], [?,124,164,128], [] and with computed input tensors: input[2] = <3>.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/bin/sleap-train", line 8, in sys.exit(main()) File "/usr/local/lib/python3.7/dist-packages/sleap/nn/training.py", line 1582, in main trainer.train() File "/usr/local/lib/python3.7/dist-packages/sleap/nn/training.py", line 875, in train self.setup() File "/usr/local/lib/python3.7/dist-packages/sleap/nn/training.py", line 869, in setup self._setup_visualization() File "/usr/local/lib/python3.7/dist-packages/sleap/nn/training.py", line 1142, in _setup_visualization training_viz_ds_iter = iter(self.training_viz_pipeline.make_dataset()) File "/usr/local/lib/python3.7/dist-packages/sleap/nn/data/pipelines.py", line 282, in make_dataset ds = transformer.transform_dataset(ds) File "/usr/local/lib/python3.7/dist-packages/sleap/nn/data/inference.py", line 40, in transform_dataset keras_model = tf.keras.Model(input_layers, self.keras_model(input_layers)) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 926, in call input_list) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 1117, in _functional_construction_call outputs = call_fn(cast_inputs, *args, kwargs) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/functional.py", line 386, in call inputs, training=training, mask=mask) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/functional.py", line 508, in _run_internal_graph outputs = node.layer(*args, *kwargs) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 926, in call input_list) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 1117, in _functional_construction_call outputs = call_fn(cast_inputs, args, kwargs) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/layers/merge.py", line 183, in call return self._merge_function(inputs) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/layers/merge.py", line 522, in _merge_function return K.concatenate(inputs, axis=self.axis) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper return target(*args, *kwargs) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/backend.py", line 2881, in concatenate return array_ops.concat([to_dense(x) for x in tensors], axis) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper return target(args, **kwargs) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/array_ops.py", line 1654, in concat return gen_array_ops.concat_v2(values=values, axis=axis, name=name) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 1222, in concat_v2 "ConcatV2", values=values, axis=axis, name=name) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 744, in _apply_op_helper attrs=attr_protos, op_def=op_def) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py", line 593, in _create_op_internal compute_device) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 3485, in _create_op_internal op_def=op_def) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 1975, in init control_input_ops, op_def) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 1815, in _create_c_op raise ValueError(str(e)) ValueError: Dimension 1 in both shapes must be equal, but are 123 and 124. Shapes are [?,123,164] and [?,124,164]. for '{{node functional_1/stack0_dec1_s8_to_s4_skip_concat/concat}} = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32](functional_1/stack0_enc2_act1_relu/Relu, functional_1/stack0_dec1_s8_to_s4_interp_bilinear/resize/ResizeBilinear, functional_1/stack0_dec1_s8_to_s4_skip_concat/concat/axis)' with input shapes: [?,123,164,64], [?,124,164,128], [] and with computed input tensors: input[2] = <3>.

talmo commented 3 years ago

Hi @Xiaoyu-Tong,

Well, to call this persistent would be an understatement!

Among all the fixes that went into v1.1.2, somehow the one that was specifically targeted at this issue got reverted...

I've now put out a new minor version that fixes this regression and you can get it here: https://github.com/murthylab/sleap/releases/tag/v1.1.3 (conda packages may take a little longer to finish building, but pip install should work)

And just to make sure, here's a Colab showing that you can indeed train with your exact dataset and training profile: https://colab.research.google.com/drive/1JPNw4whSs3AA1hDYbtP5BaE8frNWvTx4?usp=sharing

Please let us know if you're still having trouble -- you're making the ranks of top bug hunters in SLEAP! Thanks again for your patience and sorry about the inconvenience!

Cheers,

Talmo

Xiaoyu-Tong commented 3 years ago

Hi @Talmo,

Thank you for working on this! About the Colab notebook you shared with me: what is the "failing_job.json"? Is it the centroid one or the topdown one?

talmo commented 3 years ago

The centroid one (I copied the one you pasted above)

Also if you want to test it out explicitly: the issue was in the visualization code for centroids -- you can disable visualizations when training centroid models as a workaround, but it should be fixed now. Give it a go :)

Xiaoyu-Tong commented 3 years ago

The centroid one (I copied the one you pasted above)

Also if you want to test it out explicitly: the issue was in the visualization code for centroids -- you can disable visualizations when training centroid models as a workaround, but it should be fixed now. Give it a go :)

1.1.3 version seems to work!! The training has now started. Very excited!

talmo commented 3 years ago

Awesome! I'll close but again, feel free to re-open (it's the button that says "Re-open issue" or something below this post)