Closed Xiaoyu-Tong closed 3 years ago
Hi @Xiaoyu-Tong,
Well, to call this persistent would be an understatement!
Among all the fixes that went into v1.1.2, somehow the one that was specifically targeted at this issue got reverted...
I've now put out a new minor version that fixes this regression and you can get it here: https://github.com/murthylab/sleap/releases/tag/v1.1.3
(conda packages may take a little longer to finish building, but pip install
should work)
And just to make sure, here's a Colab showing that you can indeed train with your exact dataset and training profile: https://colab.research.google.com/drive/1JPNw4whSs3AA1hDYbtP5BaE8frNWvTx4?usp=sharing
Please let us know if you're still having trouble -- you're making the ranks of top bug hunters in SLEAP! Thanks again for your patience and sorry about the inconvenience!
Cheers,
Talmo
Hi @Talmo,
Thank you for working on this! About the Colab notebook you shared with me: what is the "failing_job.json"? Is it the centroid one or the topdown one?
The centroid one (I copied the one you pasted above)
Also if you want to test it out explicitly: the issue was in the visualization code for centroids -- you can disable visualizations when training centroid models as a workaround, but it should be fixed now. Give it a go :)
The centroid one (I copied the one you pasted above)
Also if you want to test it out explicitly: the issue was in the visualization code for centroids -- you can disable visualizations when training centroid models as a workaround, but it should be fixed now. Give it a go :)
1.1.3 version seems to work!! The training has now started. Very excited!
Awesome! I'll close but again, feel free to re-open (it's the button that says "Re-open issue" or something below this post)
Hi @talmo ,
I have tried version 1.1.2 but unfortunately it is still not working. I have tried to train a model using the new version. I have also tried to export a new training package in the new version and then try to train the model. But neither helps.
It works smoothly when I tried to train the topdown confidence map. But when I tried to train the centroid model by running "!sleap-train baseline.centroid.json 2BMv2_ImplantBranch.pkg.slp --run_name "2BMv2_ImplantBranch.centroid", it gave me the error attached below. Is this error not the same thing as the frame size you mentioned before? And while you investigate this error, is there any possible workaround for this? Thank you!
2021-03-25 14:32:25.338509: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 INFO:sleap.nn.training:Versions: SLEAP: 1.1.2 TensorFlow: 2.3.1 Numpy: 1.18.5 Python: 3.7.10 OS: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic INFO:sleap.nn.training:Training labels file: 2BMv2_ImplantBranch.pkg.slp INFO:sleap.nn.training:Training profile: /usr/local/lib/python3.7/dist-packages/sleap/training_profiles/baseline.centroid.json INFO:sleap.nn.training: INFO:sleap.nn.training:Arguments: INFO:sleap.nn.training:{ "training_job_path": "baseline.centroid.json", "labels_path": "2BMv2_ImplantBranch.pkg.slp", "video_paths": "", "val_labels": null, "test_labels": null, "tensorboard": false, "save_viz": false, "zmq": false, "run_name": "2BMv2_ImplantBranch.centroid", "prefix": "", "suffix": "" } INFO:sleap.nn.training: INFO:sleap.nn.training:Training job: INFO:sleap.nn.training:{ "data": { "labels": { "training_labels": null, "validation_labels": null, "validation_fraction": 0.1, "test_labels": null, "split_by_inds": false, "training_inds": null, "validation_inds": null, "test_inds": null, "search_path_hints": [], "skeletons": [] }, "preprocessing": { "ensure_rgb": false, "ensure_grayscale": false, "imagenet_mode": null, "input_scaling": 0.5, "pad_to_stride": null, "resize_and_pad_to_target": true, "target_height": null, "target_width": null }, "instance_cropping": { "center_on_part": null, "crop_size": null, "crop_size_detection_padding": 16 } }, "model": { "backbone": { "leap": null, "unet": { "stem_stride": null, "max_stride": 16, "output_stride": 2, "filters": 16, "filters_rate": 2.0, "middle_block": true, "up_interpolate": true, "stacks": 1 }, "hourglass": null, "resnet": null, "pretrained_encoder": null }, "heads": { "single_instance": null, "centroid": { "anchor_part": null, "sigma": 5.0, "output_stride": 2, "offset_refinement": false }, "centered_instance": null, "multi_instance": null } }, "optimization": { "preload_data": true, "augmentation_config": { "rotate": true, "rotation_min_angle": -180.0, "rotation_max_angle": 180.0, "translate": false, "translate_min": -5, "translate_max": 5, "scale": false, "scale_min": 0.9, "scale_max": 1.1, "uniform_noise": false, "uniform_noise_min_val": 0.0, "uniform_noise_max_val": 10.0, "gaussian_noise": false, "gaussian_noise_mean": 5.0, "gaussian_noise_stddev": 1.0, "contrast": false, "contrast_min_gamma": 0.5, "contrast_max_gamma": 2.0, "brightness": false, "brightness_min_val": 0.0, "brightness_max_val": 10.0, "random_crop": false, "random_crop_height": 256, "random_crop_width": 256, "random_flip": false, "flip_horizontal": true }, "online_shuffling": true, "shuffle_buffer_size": 128, "prefetch": true, "batch_size": 4, "batches_per_epoch": null, "min_batches_per_epoch": 200, "val_batches_per_epoch": null, "min_val_batches_per_epoch": 10, "epochs": 200, "optimizer": "adam", "initial_learning_rate": 0.0001, "learning_rate_schedule": { "reduce_on_plateau": true, "reduction_factor": 0.5, "plateau_min_delta": 1e-06, "plateau_patience": 5, "plateau_cooldown": 3, "min_learning_rate": 1e-08 }, "hard_keypoint_mining": { "online_mining": false, "hard_to_easy_ratio": 2.0, "min_hard_keypoints": 2, "max_hard_keypoints": null, "loss_scale": 5.0 }, "early_stopping": { "stop_training_on_plateau": true, "plateau_min_delta": 1e-06, "plateau_patience": 10 } }, "outputs": { "save_outputs": true, "run_name": "2BMv2_ImplantBranch.centroid", "run_name_prefix": "", "run_name_suffix": null, "runs_folder": "models", "tags": [], "save_visualizations": true, "delete_viz_images": true, "zip_outputs": false, "log_to_csv": true, "checkpointing": { "initial_model": false, "best_model": true, "every_epoch": false, "latest_model": false, "final_model": false }, "tensorboard": { "write_logs": false, "loss_frequency": "epoch", "architecture_graph": false, "profile_graph": false, "visualizations": true }, "zmq": { "subscribe_to_controller": false, "controller_address": "tcp://127.0.0.1:9000", "controller_polling_timeout": 10, "publish_updates": false, "publish_address": "tcp://127.0.0.1:9001" } }, "name": "", "description": "", "sleap_version": "1.1.2", "filename": "/usr/local/lib/python3.7/dist-packages/sleap/training_profiles/baseline.centroid.json" } INFO:sleap.nn.training: INFO:sleap.nn.training:System: 2021-03-25 14:32:27.585333: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1 2021-03-25 14:32:27.604656: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-25 14:32:27.605625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:00:04.0 name: Tesla P100-PCIE-16GB computeCapability: 6.0 coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 15.90GiB deviceMemoryBandwidth: 681.88GiB/s 2021-03-25 14:32:27.605701: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2021-03-25 14:32:27.804232: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10 2021-03-25 14:32:27.811176: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10 2021-03-25 14:32:27.817650: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10 2021-03-25 14:32:27.841652: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10 2021-03-25 14:32:27.861168: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10 2021-03-25 14:32:28.284821: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7 2021-03-25 14:32:28.285120: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-25 14:32:28.286169: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-25 14:32:28.287043: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0 GPUs: 1/1 available Device: /physical_device:GPU:0 Available: True Initalized: False Memory growth: True INFO:sleap.nn.training: INFO:sleap.nn.training:Initializing trainer... INFO:sleap.nn.training:Loading training labels from: 2BMv2_ImplantBranch.pkg.slp INFO:sleap.nn.training:Creating training and validation splits from validation fraction: 0.1 INFO:sleap.nn.training: Splits: Training = 581 / Validation = 64. INFO:sleap.nn.training:Setting up for training... INFO:sleap.nn.training:Setting up pipeline builders... INFO:sleap.nn.training:Setting up model... INFO:sleap.nn.training:Building test pipeline... 2021-03-25 14:32:32.894853: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-03-25 14:32:32.905229: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2199995000 Hz 2021-03-25 14:32:32.905660: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5599631ccfc0 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2021-03-25 14:32:32.905704: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2021-03-25 14:32:33.030127: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-25 14:32:33.031445: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5599631cda40 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2021-03-25 14:32:33.031496: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla P100-PCIE-16GB, Compute Capability 6.0 2021-03-25 14:32:33.031779: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-25 14:32:33.032702: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: pciBusID: 0000:00:04.0 name: Tesla P100-PCIE-16GB computeCapability: 6.0 coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 15.90GiB deviceMemoryBandwidth: 681.88GiB/s 2021-03-25 14:32:33.032812: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2021-03-25 14:32:33.032906: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10 2021-03-25 14:32:33.032966: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10 2021-03-25 14:32:33.033026: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10 2021-03-25 14:32:33.033116: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10 2021-03-25 14:32:33.033190: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10 2021-03-25 14:32:33.033272: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7 2021-03-25 14:32:33.033413: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-25 14:32:33.034362: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-25 14:32:33.035247: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0 2021-03-25 14:32:33.035338: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 2021-03-25 14:32:34.008489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-03-25 14:32:34.008562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0 2021-03-25 14:32:34.008589: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N 2021-03-25 14:32:34.008906: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-25 14:32:34.009931: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-03-25 14:32:34.010912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14958 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0) INFO:sleap.nn.training:Loaded test example. [4.671s] INFO:sleap.nn.training: Input shape: (256, 336, 3) INFO:sleap.nn.training:Created Keras model. INFO:sleap.nn.training: Backbone: UNet(stacks=1, filters=16, filters_rate=2.0, kernel_size=3, stem_kernel_size=7, convs_per_block=2, stem_blocks=0, down_blocks=4, middle_block=True, up_blocks=3, up_interpolate=True, block_contraction=False) INFO:sleap.nn.training: Max stride: 16 INFO:sleap.nn.training: Parameters: 1,953,393 INFO:sleap.nn.training: Heads: INFO:sleap.nn.training: [0] = CentroidConfmapsHead(anchor_part=None, sigma=5.0, output_stride=2, loss_weight=1.0) INFO:sleap.nn.training: Outputs: INFO:sleap.nn.training: [0] = Tensor("CentroidConfmapsHead_0/BiasAdd:0", shape=(None, 128, 168, 1), dtype=float32) INFO:sleap.nn.training:Setting up data pipelines... INFO:sleap.nn.training:Training set: n = 581 INFO:sleap.nn.training:Validation set: n = 64 INFO:sleap.nn.training:Setting up optimization... INFO:sleap.nn.training: Learning rate schedule: LearningRateScheduleConfig(reduce_on_plateau=True, reduction_factor=0.5, plateau_min_delta=1e-06, plateau_patience=5, plateau_cooldown=3, min_learning_rate=1e-08) INFO:sleap.nn.training: Early stopping: EarlyStoppingConfig(stop_training_on_plateau=True, plateau_min_delta=1e-06, plateau_patience=10) INFO:sleap.nn.training:Setting up outputs... INFO:sleap.nn.training:Created run path: models/2BMv2_ImplantBranch.centroid_2 INFO:sleap.nn.training:Setting up visualization... WARNING:tensorflow:Model was constructed with shape (None, 256, 336, 3) for input Tensor("input:0", shape=(None, 256, 336, 3), dtype=float32), but it was called on an input with incompatible shape (None, 492, 656, 3). Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 1812, in _create_c_op c_op = pywrap_tf_session.TF_FinishOperation(op_desc) tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimension 1 in both shapes must be equal, but are 123 and 124. Shapes are [?,123,164] and [?,124,164]. for '{{node functional_1/stack0_dec1_s8_to_s4_skip_concat/concat}} = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32](functional_1/stack0_enc2_act1_relu/Relu, functional_1/stack0_dec1_s8_to_s4_interp_bilinear/resize/ResizeBilinear, functional_1/stack0_dec1_s8_to_s4_skip_concat/concat/axis)' with input shapes: [?,123,164,64], [?,124,164,128], [] and with computed input tensors: input[2] = <3>.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/local/bin/sleap-train", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/sleap/nn/training.py", line 1582, in main
trainer.train()
File "/usr/local/lib/python3.7/dist-packages/sleap/nn/training.py", line 875, in train
self.setup()
File "/usr/local/lib/python3.7/dist-packages/sleap/nn/training.py", line 869, in setup
self._setup_visualization()
File "/usr/local/lib/python3.7/dist-packages/sleap/nn/training.py", line 1142, in _setup_visualization
training_viz_ds_iter = iter(self.training_viz_pipeline.make_dataset())
File "/usr/local/lib/python3.7/dist-packages/sleap/nn/data/pipelines.py", line 282, in make_dataset
ds = transformer.transform_dataset(ds)
File "/usr/local/lib/python3.7/dist-packages/sleap/nn/data/inference.py", line 40, in transform_dataset
keras_model = tf.keras.Model(input_layers, self.keras_model(input_layers))
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 926, in call
input_list)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 1117, in _functional_construction_call
outputs = call_fn(cast_inputs, *args, kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/functional.py", line 386, in call
inputs, training=training, mask=mask)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/functional.py", line 508, in _run_internal_graph
outputs = node.layer(*args, *kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 926, in call
input_list)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 1117, in _functional_construction_call
outputs = call_fn(cast_inputs, args, kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/layers/merge.py", line 183, in call
return self._merge_function(inputs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/layers/merge.py", line 522, in _merge_function
return K.concatenate(inputs, axis=self.axis)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
return target(*args, *kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/backend.py", line 2881, in concatenate
return array_ops.concat([to_dense(x) for x in tensors], axis)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
return target(args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/array_ops.py", line 1654, in concat
return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 1222, in concat_v2
"ConcatV2", values=values, axis=axis, name=name)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 744, in _apply_op_helper
attrs=attr_protos, op_def=op_def)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py", line 593, in _create_op_internal
compute_device)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 3485, in _create_op_internal
op_def=op_def)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 1975, in init
control_input_ops, op_def)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 1815, in _create_c_op
raise ValueError(str(e))
ValueError: Dimension 1 in both shapes must be equal, but are 123 and 124. Shapes are [?,123,164] and [?,124,164]. for '{{node functional_1/stack0_dec1_s8_to_s4_skip_concat/concat}} = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32](functional_1/stack0_enc2_act1_relu/Relu, functional_1/stack0_dec1_s8_to_s4_interp_bilinear/resize/ResizeBilinear, functional_1/stack0_dec1_s8_to_s4_skip_concat/concat/axis)' with input shapes: [?,123,164,64], [?,124,164,128], [] and with computed input tensors: input[2] = <3>.