Closed jverpeut closed 1 year ago
HI @jverpeut,
For some reason, the by the time we try to set aside a few frames for the validation split, the program thinks that the len(list(range(labels))) == 0
.
test_size=1
. C:/Users/jverpeut/Desktop/labels_2_21_DominanceOpenField.v001(1).slp
, so the command for running training should be correct.n=31
labeled frames (Run Path: C:/Users/jverpeut/Desktop\models\230419_140330.multi_class_topdown.n=31
). We have seen this before when the user had no labeled frames in their project, but that doesn't seem to be the case for you as there was a successful training of the centroid model before the multiclass top-down failed.
Is the problem limited to just the multiclass models (i.e. have you been able to successfully train multiclass)? Also, do you have tracks assigned to each instance for training multiclass (this is a requirement since it is how the classes/tracks are learned).
Thanks, Liezl
Liezl,
I believe we do have tracks assigned to each instance, but I am currently having more frames labeled to see if that solves the problem. I cannot seem to use other models based on the way my skeleton and nodes are constructed. Do you have examples of best ways to label nodes other than the ones provided as examples in the update? Those examples do not have enough joints labeled for my application.
Jess
On Thu, Apr 20, 2023 at 10:57 AM Liezl Maree @.***> wrote:
HI @jverpeut https://github.com/jverpeut,
For some reason, the by the time we try to set aside a few frames for the validation split, the program thinks that the len(list(range(labels))) == 0 .
- The validation split is guaranteed to be at least 1, as we see in test_size=1.
- It looks like you are invoking the trainer through the GUI and labels are being loaded from C:/Users/jverpeut/Desktop/labels_2_21_DominanceOpenField.v001(1).slp, so the command for running training should be correct.
- In the naming of the model, SLEAP thinks you have n=31 labeled frames (Run Path: C:/Users/jverpeut/Desktop\models\230419_140330.multi_class_topdown.n=31 ).
We have seen this before when the user had no labeled frames in their project, but that doesn't seem to be the case for you as there was a successful training of a regular top-down model before the multiclass top-down failed.
Is the problem limited to just the multiclass models (i.e. have you been able to successfully train multiclass)? Also, do you have tracks assigned to each instance for training multiclass (this is a requirement since it is how the classes/tracks are learned).
Thanks, Liezl
— Reply to this email directly, view it on GitHub https://github.com/talmolab/sleap/issues/1282#issuecomment-1516729157, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH4BZHZYNWCVKSF4IGROCSDXCF2HZANCNFSM6AAAAAAXEUKVJ4 . You are receiving this because you were mentioned.Message ID: @.***>
Hi @jverpeut,
I don't think labeling more frames will help... The structure of the skeleton will determine if you can run a bottom-up model. You should be able to run just a normal top-down model (no multiclass). Do you mind giving this a try and letting me know if you have the same error?
Do you have examples of best ways to label nodes other than the ones provided as examples in the update? Those examples do not have enough joints labeled for my application.
When you say "best way to label nodes", do you mean the skeleton construction (connecting nodes via edges)?
If so, the nodes should be any body part you are interested in tracking. The edges will connect "source" nodes to "destination" nodes. If you would like to use the bottom-up model, then you will need to construct your skeleton s.t. each destination node has only one source node (think of this like a tree where the trunk/source splits into branches/destinations).
Often, as the trunk of this tree, we choose a point that is easy to find throughout the video (such as a central node on the body [e.g. torso]). This is because SLEAP uses the source nodes to help find destination nodes in bottom-up via Part Affinity Fields (PAFs) and we want to choose our top source node as something that is easy to find.
It is better to create a short stubby tree from a few source nodes that are easily found than to create a tall tree with too many source nodes as "losing := being unable to locate" one node in the chain could domino into losing the rest of the chain.
Thanks, Liezl
Liezl,
We went back and added 260 labels. Now, we still received an error. I have the output attached.
Jess
On Mon, Apr 24, 2023 at 10:25 AM Liezl Maree @.***> wrote:
Hi @jverpeut https://github.com/jverpeut,
I don't think labeling more frames will help... The structure of the skeleton will determine if you can run a bottom-up model. You should be able to run just a normal top-down model (no multiclass). Do you mind giving this a try and letting me know if you have the same error?
Do you have examples of best ways to label nodes other than the ones provided as examples in the update? Those examples do not have enough joints labeled for my application.
When you say "best way to label nodes", do you mean the skeleton construction (connecting nodes via edges)?
If so, the nodes should be any body part you are interested in tracking. The edges will connect "source" nodes to "destination" nodes. If you would like to use the bottom-up model, then you will need to construct your skeleton s.t. each destination node has only one source node (think of this like a tree where the trunk/source splits into branches/destinations).
Often, as the trunk of this tree, we choose a point that is easy to find throughout the video (such as a central node on the body [e.g. torso]). This is because SLEAP uses the source nodes to help find destination nodes in bottom-up via Part Affinity Fields (PAFs) and we want to choose our top source node as something that is easy to find.
It is better to create a short stubby tree from a few source nodes that are easily found than to create a tall tree with too many source nodes as "losing := being unable to locate" one node in the chain could domino into losing the rest of the chain.
Thanks, Liezl
— Reply to this email directly, view it on GitHub https://github.com/talmolab/sleap/issues/1282#issuecomment-1520559683, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH4BZH2DOUDO6GW5L2EAKZDXC2ZSHANCNFSM6AAAAAAXEUKVJ4 . You are receiving this because you were mentioned.Message ID: @.***>
Hi @jverpeut,
Yeah... adding more labels does not seem like it will help in this case, but now we know for sure!
The output was not sent to the post on github... Do you mind attaching the output directly to your comment on github: https://github.com/talmolab/sleap/issues/1282#issuecomment-1570831728?
You should be able to run just a normal top-down model (no multiclass). Do you mind giving this a try and letting me know if you have the same error?
Thanks, Liezl
Still receiving an error with more labels:
Hi @jverpeut,
It looks like you might be on 1.3.0a0
in the latest logs -- I think the bug you're getting at the end should be fixed in 1.3.0 if you want to give that a go!
Talmo
Thank you. I will update the software and try again.
This time training was able to start, but failed at centered instance:
Traceback (most recent call last):
File "C:\Users\verpeutlab\miniconda3\envs\sleap\Scripts\sleap-train-script.py", line 33, in <module>
sys.exit(load_entry_point('sleap==1.3.0', 'console_scripts', 'sleap-train')())
File "C:\Users\verpeutlab\miniconda3\envs\sleap\lib\site-packages\sleap\nn\training.py", line 2014, in main
trainer.train()
File "C:\Users\verpeutlab\miniconda3\envs\sleap\lib\site-packages\sleap\nn\training.py", line 943, in train
verbose=2,
File "C:\Users\verpeutlab\miniconda3\envs\sleap\lib\site-packages\keras\engine\training.py", line 1230, in fit
callbacks.on_epoch_end(epoch, epoch_logs)
File "C:\Users\verpeutlab\miniconda3\envs\sleap\lib\site-packages\keras\callbacks.py", line 413, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "C:\Users\verpeutlab\miniconda3\envs\sleap\lib\site-packages\sleap\nn\callbacks.py", line 280, in on_epoch_end
figure = self.plot_fn()
File "C:\Users\verpeutlab\miniconda3\envs\sleap\lib\site-packages\sleap\nn\training.py", line 1348, in <lambda>
viz_fn=lambda: visualize_example(next(training_viz_ds_iter)),
File "C:\Users\verpeutlab\miniconda3\envs\sleap\lib\site-packages\sleap\nn\training.py", line 1328, in visualize_example
preds = find_peaks(tf.expand_dims(example["instance_image"], axis=0))
File "C:\Users\verpeutlab\miniconda3\envs\sleap\lib\site-packages\keras\engine\base_layer.py", line 1037, in __call__
outputs = call_fn(inputs, *args, **kwargs)
File "C:\Users\verpeutlab\miniconda3\envs\sleap\lib\site-packages\sleap\nn\inference.py", line 2071, in call
out = self.keras_model(crops)
File "C:\Users\verpeutlab\miniconda3\envs\sleap\lib\site-packages\keras\engine\base_layer.py", line 1020, in __call__
input_spec.assert_input_compatibility(self.input_spec, inputs, self.name)
File "C:\Users\verpeutlab\miniconda3\envs\sleap\lib\site-packages\keras\engine\input_spec.py", line 269, in assert_input_compatibility
', found shape=' + display_shape(x.shape))
ValueError: Input 0 is incompatible with layer model: expected shape=(None, 384, 384, 1), found shape=(1, 153, 153, 1)
INFO:sleap.nn.callbacks:Closing the reporter controller/context.
INFO:sleap.nn.callbacks:Closing the training controller socket/context.
Hi @jverpeut,
Can you retrain but keep the input scaling on the centered-instance model at 1.0? For background, this looks very similar to the error discussed here.
Can you try training the centered instance model with an input scaling = 1 (I believe you currently have
"input_scaling": 1.75
). There is currently an open issue #872 that appears when input scaling is anything other than 1 on the centered instance model. Also a heads up: if you were setting the input scaling to adjust the receptive field size, then we recommend decreasing the max output stride instead of increasing the input scaling past 1. An input scaling greater than 1 will create redundant pixels (no new features) that are passed into the network (and make training take longer).
Thanks, Liezl
Liezl,
Changing the input scaling worked. I will close this ticket. Thank you
Jess
Hi @roomrys , I am also having this same error now when trying to train a "multi-animal top-down id" model. The training of the "centroid" model works fine, but the "centered instance" fails: The input_scaling
param is set to 1.0 for both.
INFO:sleap.nn.training:
INFO:sleap.nn.training:Initializing trainer...
INFO:sleap.nn.training:Loading training labels from: C:\Users\jai\ProjectAeon\sleap_playground\social_boys_multiclass_id_topdown\labels.v001.slp
INFO:sleap.nn.training:Creating training and validation splits from validation fraction: 0.1
Traceback (most recent call last):
File "C:\Users\jai\mambaforge\envs\sleap1.3.3\Scripts\sleap-train-script.py", line 33, in <module>
sys.exit(load_entry_point('sleap==1.3.3', 'console_scripts', 'sleap-train')())
File "C:\Users\jai\mambaforge\envs\sleap1.3.3\lib\site-packages\sleap\nn\training.py", line 2013, in main
trainer = create_trainer_using_cli(args=args)
File "C:\Users\jai\mambaforge\envs\sleap1.3.3\lib\site-packages\sleap\nn\training.py", line 2005, in create_trainer_using_cli
video_search_paths=args.video_paths,
File "C:\Users\jai\mambaforge\envs\sleap1.3.3\lib\site-packages\sleap\nn\training.py", line 673, in from_config
with_track_only=is_id_model,
File "C:\Users\jai\mambaforge\envs\sleap1.3.3\lib\site-packages\sleap\nn\training.py", line 150, in from_config
with_track_only=with_track_only,
File "C:\Users\jai\mambaforge\envs\sleap1.3.3\lib\site-packages\sleap\nn\training.py", line 218, in from_labels
validation,
File "C:\Users\jai\mambaforge\envs\sleap1.3.3\lib\site-packages\sleap\nn\data\training.py", line 49, in split_labels_train_val
idx_train, idx_val = train_test_split(list(range(len(labels))), test_size=n_val)
File "C:\Users\jai\mambaforge\envs\sleap1.3.3\lib\site-packages\sklearn\model_selection\_split.py", line 2423, in train_test_split
n_samples, test_size, train_size, default_test_size=0.25
File "C:\Users\jai\mambaforge\envs\sleap1.3.3\lib\site-packages\sklearn\model_selection\_split.py", line 2046, in _validate_shuffle_split
"(0, 1) range".format(test_size, n_samples)
ValueError: test_size=1 should be either positive and smaller than the number of samples 0 or a float in the (0, 1) range
You said this in an earlier comment in this thread:
Is the problem limited to just the multiclass models (i.e. have you been able to successfully train multiclass)? Also, do you have tracks assigned to each instance for training multiclass (this is a requirement since it is how the classes/tracks are learned).
In my case, it indeed works fine with just a "multi-animal top-down" instead of a"multi-animal top-down id" pipeline. When you refer to "assigning tracks" here, what do you mean? I thought the tracker is only implemented during running of Inference, which requires a trained multi_class_topdown model, the training of which is failing for me.
FYI this error occurs for me in both v1.3.1 and v1.3.3
Oops ok, I realized I had to assign the instances to tracks in the labeling, which I've now done. However now when trying to train the centered-instance (multi_class_topdown) model, I get the following error:
INFO:sleap.nn.training:Loaded test example. [2.827s]
INFO:sleap.nn.training: Input shape: (128, 128, 1)
Traceback (most recent call last):
File "C:\Users\jai\mambaforge\envs\sleap1.3.3\Scripts\sleap-train-script.py", line 33, in <module>
sys.exit(load_entry_point('sleap==1.3.3', 'console_scripts', 'sleap-train')())
File "C:\Users\jai\mambaforge\envs\sleap1.3.3\lib\site-packages\sleap\nn\training.py", line 2014, in main
trainer.train()
File "C:\Users\jai\mambaforge\envs\sleap1.3.3\lib\site-packages\sleap\nn\training.py", line 924, in train
self.setup()
File "C:\Users\jai\mambaforge\envs\sleap1.3.3\lib\site-packages\sleap\nn\training.py", line 910, in setup
self._setup_model()
File "C:\Users\jai\mambaforge\envs\sleap1.3.3\lib\site-packages\sleap\nn\training.py", line 734, in _setup_model
self.model.make_model(input_shape)
File "C:\Users\jai\mambaforge\envs\sleap1.3.3\lib\site-packages\sleap\nn\model.py", line 356, in make_model
f"Could not find a feature activation for output at stride "
ValueError: Could not find a feature activation for output at stride 1.
Oooops, and I realized the fix for this last error was just making sure the output_stride
s for the `"confmaps" and "class_vectors" matched (I set them both to 2 now). So feel free to ignore these comments!
Bug description
ValueError: test_size=1 should be either positive and smaller than the number of samples 0 or a float in the (0, 1) range
Once this error occurs I have to close all of SLEAP. We are attempting to track 3 mice and have 22 frames labels total. My first thought is that there are not enough frames labeled, but I would like to understand the reason for this error in more detail.
Expected behaviour
Training would finish.
Actual behaviour
Error message
Your personal set up
sleap 1.3.0 personal computer with GPU
Start-up
``` (base) C:\WINDOWS\system32>conda activate sleap1.3 (sleap1.3) C:\WINDOWS\system32>sleap-label Saving config: C:\Users\jverpeut/.sleap/1.3.0/preferences.yaml Restoring GUI state... Software versions: SLEAP: 1.3.0 TensorFlow: 2.6.3 Numpy: 1.19.5 Python: 3.7.12 OS: Windows-10-10.0.19041-SP0 Happy SLEAPing! :) ```Unrelated traceback (`DeleteSelectedInstance`)
``` Traceback (most recent call last): File "C:\ProgramData\Anaconda3\envs\sleap1.3\lib\site-packages\sleap\gui\commands.py", line 553, in deleteSelectedInstance self.execute(DeleteSelectedInstance) File "C:\ProgramData\Anaconda3\envs\sleap1.3\lib\site-packages\sleap\gui\commands.py", line 242, in execute command().execute(context=self, params=kwargs) File "C:\ProgramData\Anaconda3\envs\sleap1.3\lib\site-packages\sleap\gui\commands.py", line 139, in execute self.do_with_signal(context, params) File "C:\ProgramData\Anaconda3\envs\sleap1.3\lib\site-packages\sleap\gui\commands.py", line 163, in do_with_signal cls.do_action(context, params) File "C:\ProgramData\Anaconda3\envs\sleap1.3\lib\site-packages\sleap\gui\commands.py", line 2474, in do_action context.labels.remove_instance(context.state["labeled_frame"], selected_inst) File "C:\ProgramData\Anaconda3\envs\sleap1.3\lib\site-packages\sleap\io\dataset.py", line 1323, in remove_instance frame.instances.remove(instance) ValueError: list.remove(x): x not in list ```Successful top-down (centroid) training
``` Resetting monitor window. Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png Start training centroid... ['sleap-train', 'C:\\Users\\jverpeut\\AppData\\Local\\Temp\\tmpr37lzynp\\230419_132746_training_job.json', 'C:/Users/jverpeut/Desktop/labels_2_21_DominanceOpenField.v001(1).slp', '--zmq', '--save_viz'] INFO:sleap.nn.training:Versions: SLEAP: 1.3.0 TensorFlow: 2.6.3 Numpy: 1.19.5 Python: 3.7.12 OS: Windows-10-10.0.19041-SP0 INFO:sleap.nn.training:Training labels file: C:/Users/jverpeut/Desktop/labels_2_21_DominanceOpenField.v001(1).slp INFO:sleap.nn.training:Training profile: C:\Users\jverpeut\AppData\Local\Temp\tmpr37lzynp\230419_132746_training_job.json INFO:sleap.nn.training: INFO:sleap.nn.training:Arguments: INFO:sleap.nn.training:{ "training_job_path": "C:\\Users\\jverpeut\\AppData\\Local\\Temp\\tmpr37lzynp\\230419_132746_training_job.json", "labels_path": "C:/Users/jverpeut/Desktop/labels_2_21_DominanceOpenField.v001(1).slp", "video_paths": [ "" ], "val_labels": null, "test_labels": null, "base_checkpoint": null, "tensorboard": false, "save_viz": true, "zmq": true, "run_name": "", "prefix": "", "suffix": "", "cpu": false, "first_gpu": false, "last_gpu": false, "gpu": "auto" } INFO:sleap.nn.training: INFO:sleap.nn.training:Training job: INFO:sleap.nn.training:{ "data": { "labels": { "training_labels": "C:/Users/jverpeut/Desktop/labels_2_21_DominanceOpenField.v001(1).slp", "validation_labels": null, "validation_fraction": 0.1, "test_labels": null, "split_by_inds": false, "training_inds": [ 8, 11, 1, 0, 14, 10, 5, 4, 13, 7, 2, 19, 9, 3, 6, 15, 21, 16, 18, 12 ], "validation_inds": [ 17, 20 ], "test_inds": null, "search_path_hints": [ "", "" ], "skeletons": [] }, "preprocessing": { "ensure_rgb": false, "ensure_grayscale": false, "imagenet_mode": null, "input_scaling": 0.5, "pad_to_stride": 16, "resize_and_pad_to_target": true, "target_height": 1088, "target_width": 1456 }, "instance_cropping": { "center_on_part": null, "crop_size": null, "crop_size_detection_padding": 16 } }, "model": { "backbone": { "leap": null, "unet": { "stem_stride": null, "max_stride": 16, "output_stride": 2, "filters": 16, "filters_rate": 2.0, "middle_block": true, "up_interpolate": true, "stacks": 1 }, "hourglass": null, "resnet": null, "pretrained_encoder": null }, "heads": { "single_instance": null, "centroid": { "anchor_part": null, "sigma": 2.5, "output_stride": 2, "loss_weight": 1.0, "offset_refinement": false }, "centered_instance": null, "multi_instance": null, "multi_class_bottomup": null, "multi_class_topdown": null }, "base_checkpoint": null }, "optimization": { "preload_data": true, "augmentation_config": { "rotate": true, "rotation_min_angle": -180.0, "rotation_max_angle": 180.0, "translate": false, "translate_min": -5, "translate_max": 5, "scale": false, "scale_min": 0.9, "scale_max": 1.1, "uniform_noise": false, "uniform_noise_min_val": 0.0, "uniform_noise_max_val": 10.0, "gaussian_noise": false, "gaussian_noise_mean": 5.0, "gaussian_noise_stddev": 1.0, "contrast": false, "contrast_min_gamma": 0.5, "contrast_max_gamma": 2.0, "brightness": false, "brightness_min_val": 0.0, "brightness_max_val": 10.0, "random_crop": false, "random_crop_height": 256, "random_crop_width": 256, "random_flip": false, "flip_horizontal": false }, "online_shuffling": true, "shuffle_buffer_size": 128, "prefetch": true, "batch_size": 4, "batches_per_epoch": 200, "min_batches_per_epoch": 200, "val_batches_per_epoch": 10, "min_val_batches_per_epoch": 10, "epochs": 200, "optimizer": "adam", "initial_learning_rate": 0.0001, "learning_rate_schedule": { "reduce_on_plateau": true, "reduction_factor": 0.5, "plateau_min_delta": 1e-06, "plateau_patience": 5, "plateau_cooldown": 3, "min_learning_rate": 1e-08 }, "hard_keypoint_mining": { "online_mining": false, "hard_to_easy_ratio": 2.0, "min_hard_keypoints": 2, "max_hard_keypoints": null, "loss_scale": 5.0 }, "early_stopping": { "stop_training_on_plateau": true, "plateau_min_delta": 1e-08, "plateau_patience": 20 } }, "outputs": { "save_outputs": true, "run_name": "230419_132745.centroid.n=31", "run_name_prefix": "", "run_name_suffix": "", "runs_folder": "C:/Users/jverpeut/Desktop\\models", "tags": [ "" ], "save_visualizations": true, "delete_viz_images": true, "zip_outputs": false, "log_to_csv": true, "checkpointing": { "initial_model": false, "best_model": true, "every_epoch": false, "latest_model": false, "final_model": false }, "tensorboard": { "write_logs": false, "loss_frequency": "epoch", "architecture_graph": false, "profile_graph": false, "visualizations": true }, "zmq": { "subscribe_to_controller": true, "controller_address": "tcp://127.0.0.1:9000", "controller_polling_timeout": 10, "publish_updates": true, "publish_address": "tcp://127.0.0.1:9001" } }, "name": "", "description": "", "sleap_version": "1.3.0", "filename": "C:\\Users\\jverpeut\\AppData\\Local\\Temp\\tmpr37lzynp\\230419_132746_training_job.json" } INFO:sleap.nn.training: INFO:sleap.nn.training:Auto-selected GPU 0 with 2567 MiB of free memory. INFO:sleap.nn.training:Using GPU 0 for acceleration. INFO:sleap.nn.training:Disabled GPU memory pre-allocation. INFO:sleap.nn.training:System: GPUs: 1/1 available Device: /physical_device:GPU:0 Available: True Initalized: False Memory growth: True INFO:sleap.nn.training: INFO:sleap.nn.training:Initializing trainer... INFO:sleap.nn.training:Loading training labels from: C:/Users/jverpeut/Desktop/labels_2_21_DominanceOpenField.v001(1).slp INFO:sleap.nn.training:Creating training and validation splits from validation fraction: 0.1 INFO:sleap.nn.training: Splits: Training = 28 / Validation = 3. INFO:sleap.nn.training:Setting up for training... INFO:sleap.nn.training:Setting up pipeline builders... INFO:sleap.nn.training:Setting up model... INFO:sleap.nn.training:Building test pipeline... 2023-04-19 13:27:57.206226: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2 To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-04-19 13:27:57.937245: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3619 MB memory: -> device: 0, name: Quadro P2200, pci bus id: 0000:b3:00.0, compute capability: 6.1 2023-04-19 13:27:58.814210: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2) INFO:sleap.nn.training:Loaded test example. [3.430s] INFO:sleap.nn.training: Input shape: (544, 736, 1) INFO:sleap.nn.training:Created Keras model. INFO:sleap.nn.training: Backbone: UNet(stacks=1, filters=16, filters_rate=2.0, kernel_size=3, stem_kernel_size=7, convs_per_block=2, stem_blocks=0, down_blocks=4, middle_block=True, up_blocks=3, up_interpolate=True, block_contraction=False) INFO:sleap.nn.training: Max stride: 16 INFO:sleap.nn.training: Parameters: 1,953,105 INFO:sleap.nn.training: Heads: INFO:sleap.nn.training: [0] = CentroidConfmapsHead(anchor_part=None, sigma=2.5, output_stride=2, loss_weight=1.0) INFO:sleap.nn.training: Outputs: INFO:sleap.nn.training: [0] = KerasTensor(type_spec=TensorSpec(shape=(None, 272, 368, 1), dtype=tf.float32, name=None), name='CentroidConfmapsHead/BiasAdd:0', description="created by layer 'CentroidConfmapsHead'") INFO:sleap.nn.training:Training from scratch INFO:sleap.nn.training:Setting up data pipelines... INFO:sleap.nn.training:Training set: n = 28 INFO:sleap.nn.training:Validation set: n = 3 INFO:sleap.nn.training:Setting up optimization... INFO:sleap.nn.training: Learning rate schedule: LearningRateScheduleConfig(reduce_on_plateau=True, reduction_factor=0.5, plateau_min_delta=1e-06, plateau_patience=5, plateau_cooldown=3, min_learning_rate=1e-08) INFO:sleap.nn.training: Early stopping: EarlyStoppingConfig(stop_training_on_plateau=True, plateau_min_delta=1e-08, plateau_patience=20) INFO:sleap.nn.training:Setting up outputs... INFO:sleap.nn.callbacks:Training controller subscribed to: tcp://127.0.0.1:9000 (topic: ) INFO:sleap.nn.training: ZMQ controller subcribed to: tcp://127.0.0.1:9000 INFO:sleap.nn.callbacks:Progress reporter publishing on: tcp://127.0.0.1:9001 for: not_set INFO:sleap.nn.training: ZMQ progress reporter publish on: tcp://127.0.0.1:9001 INFO:sleap.nn.training:Created run path: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31 INFO:sleap.nn.training:Setting up visualization... 2023-04-19 13:28:03.893514: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: -34 } dim { size: -35 } dim { size: -36 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -2 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -2 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "GPU" vendor: "NVIDIA" model: "Quadro P2200" frequency: 1493 num_cores: 10 environment { key: "architecture" value: "6.1" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 1310720 shared_memory_size_per_multiprocessor: 98304 memory_size: 3795648512 bandwidth: 200200000 } outputs { dtype: DT_FLOAT shape { dim { size: -2 } dim { size: -37 } dim { size: -38 } dim { size: 1 } } } 2023-04-19 13:28:05.486650: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: -34 } dim { size: -35 } dim { size: -36 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -2 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -2 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "GPU" vendor: "NVIDIA" model: "Quadro P2200" frequency: 1493 num_cores: 10 environment { key: "architecture" value: "6.1" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 1310720 shared_memory_size_per_multiprocessor: 98304 memory_size: 3795648512 bandwidth: 200200000 } outputs { dtype: DT_FLOAT shape { dim { size: -2 } dim { size: -37 } dim { size: -38 } dim { size: 1 } } } INFO:sleap.nn.training:Finished trainer set up. [8.6s] INFO:sleap.nn.training:Creating tf.data.Datasets for training data generation... INFO:sleap.nn.training:Finished creating training datasets. [5.4s] INFO:sleap.nn.training:Starting training loop... Epoch 1/200 2023-04-19 13:28:13.216876: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8201 2023-04-19 13:28:16.697417: W tensorflow/core/common_runtime/bfc_allocator.cc:338] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature. 200/200 - 70s - loss: 4.5381e-04 - val_loss: 4.5703e-04 2023-04-19 13:29:23.967433: W tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.06GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023-04-19 13:29:24.357389: W tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.06GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023-04-19 13:29:25.109251: W tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.06GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023-04-19 13:29:29.028107: W tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.04GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023-04-19 13:29:29.338813: W tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.04GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023-04-19 13:29:33.900338: W tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.06GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023-04-19 13:29:34.788627: W tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.06GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023-04-19 13:29:34.805575: W tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.06GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023-04-19 13:29:34.806467: W tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.06GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023-04-19 13:29:34.809179: W tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.06GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023-04-19 13:31:34.100101: W tensorflow/core/kernels/gpu_utils.cc:49] Failed to allocate memory for convolution redzone checking; skipping this check. This is benign and only means that we won't check cudnn for out-of-bounds reads and writes. This message will only be printed once. Epoch 2/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 59s - loss: 4.5045e-04 - val_loss: 4.5560e-04 Epoch 3/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 59s - loss: 4.4688e-04 - val_loss: 4.4625e-04 Epoch 4/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 58s - loss: 4.4839e-04 - val_loss: 4.5567e-04 Epoch 5/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 57s - loss: 4.4493e-04 - val_loss: 4.5874e-04 Epoch 6/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 58s - loss: 4.4122e-04 - val_loss: 4.5623e-04 Epoch 7/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 58s - loss: 4.3153e-04 - val_loss: 4.5450e-04 Epoch 8/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 58s - loss: 4.2344e-04 - val_loss: 4.4870e-04 Epoch 00008: ReduceLROnPlateau reducing learning rate to 4.999999873689376e-05. Epoch 9/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 59s - loss: 4.0219e-04 - val_loss: 4.5203e-04 Epoch 10/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 58s - loss: 3.9436e-04 - val_loss: 4.5180e-04 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png Epoch 11/200 200/200 - 58s - loss: 3.7917e-04 - val_loss: 4.2054e-04 Epoch 12/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 58s - loss: 3.6662e-04 - val_loss: 4.2748e-04 Epoch 13/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 58s - loss: 3.4939e-04 - val_loss: 3.9647e-04 Epoch 14/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 58s - loss: 3.3518e-04 - val_loss: 4.2778e-04 Epoch 15/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 58s - loss: 3.1607e-04 - val_loss: 4.1903e-04 Epoch 16/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 58s - loss: 3.0817e-04 - val_loss: 4.0571e-04 Epoch 17/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 59s - loss: 2.8899e-04 - val_loss: 4.3344e-04 Epoch 18/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 58s - loss: 2.7890e-04 - val_loss: 4.2888e-04 Epoch 00018: ReduceLROnPlateau reducing learning rate to 2.499999936844688e-05. Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png Epoch 19/200 200/200 - 58s - loss: 2.4180e-04 - val_loss: 4.3364e-04 Epoch 20/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 58s - loss: 2.2873e-04 - val_loss: 4.0236e-04 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png Epoch 21/200 200/200 - 58s - loss: 2.1771e-04 - val_loss: 4.2498e-04 Epoch 22/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 58s - loss: 2.1696e-04 - val_loss: 4.2185e-04 Epoch 23/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 59s - loss: 2.0783e-04 - val_loss: 4.4158e-04 Epoch 24/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 59s - loss: 1.9939e-04 - val_loss: 4.1644e-04 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png Epoch 25/200 200/200 - 58s - loss: 1.9314e-04 - val_loss: 4.4691e-04 Epoch 00025: ReduceLROnPlateau reducing learning rate to 1.249999968422344e-05. Epoch 26/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 58s - loss: 1.7297e-04 - val_loss: 4.3931e-04 Epoch 27/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 59s - loss: 1.6557e-04 - val_loss: 4.3326e-04 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png Epoch 28/200 200/200 - 59s - loss: 1.6605e-04 - val_loss: 4.3827e-04 Epoch 29/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 59s - loss: 1.6130e-04 - val_loss: 4.3297e-04 Epoch 30/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 59s - loss: 1.5728e-04 - val_loss: 4.3038e-04 Epoch 31/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 59s - loss: 1.5561e-04 - val_loss: 4.3719e-04 Epoch 32/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 59s - loss: 1.5410e-04 - val_loss: 4.3233e-04 Epoch 00032: ReduceLROnPlateau reducing learning rate to 6.24999984211172e-06. Epoch 33/200 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png 200/200 - 59s - loss: 1.4313e-04 - val_loss: 4.4609e-04 Polling: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz\validation.*.png Epoch 00033: early stopping INFO:sleap.nn.training:Finished training loop. [35.1 min] INFO:sleap.nn.training:Deleting visualization directory: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\viz INFO:sleap.nn.training:Saving evaluation metrics to model folder... Predicting... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% ETA: -:--:-- ?2023-04-19 14:03:20.481461: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: -69 } dim { size: -70 } dim { size: -71 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -5 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "GPU" vendor: "NVIDIA" model: "Quadro P2200" frequency: 1493 num_cores: 10 environment { key: "architecture" value: "6.1" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 1310720 shared_memory_size_per_multiprocessor: 98304 memory_size: 3795648512 bandwidth: 200200000 } outputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: -72 } dim { size: -73 } dim { size: 1 } } } 2023-04-19 14:03:20.483273: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_UINT8 } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_UINT8 shape { dim { size: 4 } dim { size: 1088 } dim { size: 1456 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -5 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "CPU" vendor: "GenuineIntel" model: "101" frequency: 2095 num_cores: 40 environment { key: "cpu_instruction_set" value: "SSE, SSE2" } environment { key: "eigen" value: "3.3.90" } l1_cache_size: 32768 l2_cache_size: 1048576 l3_cache_size: 28835840 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: -80 } dim { size: -81 } dim { size: 1 } } } Predicting... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% ETA: 0:00:00 22.9 FPS INFO:sleap.nn.evals:Saved predictions: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\labels_pr.train.slp INFO:sleap.nn.evals:Saved metrics: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\metrics.train.npz INFO:sleap.nn.evals:OKS mAP: 0.486954 Predicting... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% ETA: -:--:-- ?2023-04-19 14:03:24.584732: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: -69 } dim { size: -70 } dim { size: -71 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -5 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "GPU" vendor: "NVIDIA" model: "Quadro P2200" frequency: 1493 num_cores: 10 environment { key: "architecture" value: "6.1" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 1310720 shared_memory_size_per_multiprocessor: 98304 memory_size: 3795648512 bandwidth: 200200000 } outputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: -72 } dim { size: -73 } dim { size: 1 } } } 2023-04-19 14:03:24.586885: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_UINT8 } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_UINT8 shape { dim { size: 3 } dim { size: 1088 } dim { size: 1456 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -5 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "CPU" vendor: "GenuineIntel" model: "101" frequency: 2095 num_cores: 40 environment { key: "cpu_instruction_set" value: "SSE, SSE2" } environment { key: "eigen" value: "3.3.90" } l1_cache_size: 32768 l2_cache_size: 1048576 l3_cache_size: 28835840 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: -80 } dim { size: -81 } dim { size: 1 } } } Predicting... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% ETA: 0:00:00 ? C:\ProgramData\Anaconda3\envs\sleap1.3\lib\site-packages\sleap\nn\evals.py:506: RuntimeWarning: Mean of empty slice "dist.avg": np.nanmean(dists), C:\ProgramData\Anaconda3\envs\sleap1.3\lib\site-packages\sleap\nn\evals.py:539: RuntimeWarning: Mean of empty slice. mPCK = mPCK_parts.mean() C:\ProgramData\Anaconda3\envs\sleap1.3\lib\site-packages\numpy\core\_methods.py:170: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) C:\ProgramData\Anaconda3\envs\sleap1.3\lib\site-packages\sleap\nn\evals.py:633: RuntimeWarning: Mean of empty slice. pair_pck = metrics["pck.pcks"].mean(axis=-1).mean(axis=-1) C:\ProgramData\Anaconda3\envs\sleap1.3\lib\site-packages\numpy\core\_methods.py:163: RuntimeWarning: invalid value encountered in true_divide ret, rcount, out=ret, casting='unsafe', subok=False) C:\ProgramData\Anaconda3\envs\sleap1.3\lib\site-packages\sleap\nn\evals.py:635: RuntimeWarning: Mean of empty slice. metrics["oks.mOKS"] = pair_oks.mean() WARNING:sleap.nn.evals:Failed to compute metrics. INFO:sleap.nn.evals:Saved predictions: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31\labels_pr.val.slp INFO:sleap.nn.callbacks:Closing the reporter controller/context. INFO:sleap.nn.callbacks:Closing the training controller socket/context. Run Path: C:/Users/jverpeut/Desktop\models\230419_132745.centroid.n=31 Finished training centroid. ```Unsuccessful multiclass top-down (centered-instance)
``` Resetting monitor window. Polling: C:/Users/jverpeut/Desktop\models\230419_140330.multi_class_topdown.n=31\viz\validation.*.png Start training multi_class_topdown... ['sleap-train', 'C:\\Users\\jverpeut\\AppData\\Local\\Temp\\tmpomgh3qmu\\230419_140330_training_job.json', 'C:/Users/jverpeut/Desktop/labels_2_21_DominanceOpenField.v001(1).slp', '--zmq', '--save_viz'] INFO:sleap.nn.training:Versions: SLEAP: 1.3.0 TensorFlow: 2.6.3 Numpy: 1.19.5 Python: 3.7.12 OS: Windows-10-10.0.19041-SP0 INFO:sleap.nn.training:Training labels file: C:/Users/jverpeut/Desktop/labels_2_21_DominanceOpenField.v001(1).slp INFO:sleap.nn.training:Training profile: C:\Users\jverpeut\AppData\Local\Temp\tmpomgh3qmu\230419_140330_training_job.json INFO:sleap.nn.training: INFO:sleap.nn.training:Arguments: INFO:sleap.nn.training:{ "training_job_path": "C:\\Users\\jverpeut\\AppData\\Local\\Temp\\tmpomgh3qmu\\230419_140330_training_job.json", "labels_path": "C:/Users/jverpeut/Desktop/labels_2_21_DominanceOpenField.v001(1).slp", "video_paths": [ "" ], "val_labels": null, "test_labels": null, "base_checkpoint": null, "tensorboard": false, "save_viz": true, "zmq": true, "run_name": "", "prefix": "", "suffix": "", "cpu": false, "first_gpu": false, "last_gpu": false, "gpu": "auto" } INFO:sleap.nn.training: INFO:sleap.nn.training:Training job: INFO:sleap.nn.training:{ "data": { "labels": { "training_labels": null, "validation_labels": null, "validation_fraction": 0.1, "test_labels": null, "split_by_inds": false, "training_inds": null, "validation_inds": null, "test_inds": null, "search_path_hints": [], "skeletons": [] }, "preprocessing": { "ensure_rgb": false, "ensure_grayscale": false, "imagenet_mode": null, "input_scaling": 1.0, "pad_to_stride": null, "resize_and_pad_to_target": true, "target_height": null, "target_width": null }, "instance_cropping": { "center_on_part": null, "crop_size": null, "crop_size_detection_padding": 16 } }, "model": { "backbone": { "leap": null, "unet": { "stem_stride": null, "max_stride": 64, "output_stride": 2, "filters": 64, "filters_rate": 2.0, "middle_block": true, "up_interpolate": false, "stacks": 1 }, "hourglass": null, "resnet": null, "pretrained_encoder": null }, "heads": { "single_instance": null, "centroid": null, "centered_instance": null, "multi_instance": null, "multi_class_bottomup": null, "multi_class_topdown": { "confmaps": { "anchor_part": null, "part_names": null, "sigma": 2.5, "output_stride": 2, "loss_weight": 1.0, "offset_refinement": false }, "class_vectors": { "classes": null, "num_fc_layers": 3, "num_fc_units": 64, "global_pool": true, "output_stride": 1, "loss_weight": 1.0 } } }, "base_checkpoint": null }, "optimization": { "preload_data": true, "augmentation_config": { "rotate": true, "rotation_min_angle": -180.0, "rotation_max_angle": 180.0, "translate": false, "translate_min": -5, "translate_max": 5, "scale": false, "scale_min": 0.9, "scale_max": 1.1, "uniform_noise": false, "uniform_noise_min_val": 0.0, "uniform_noise_max_val": 10.0, "gaussian_noise": false, "gaussian_noise_mean": 5.0, "gaussian_noise_stddev": 1.0, "contrast": false, "contrast_min_gamma": 0.5, "contrast_max_gamma": 2.0, "brightness": false, "brightness_min_val": 0.0, "brightness_max_val": 10.0, "random_crop": false, "random_crop_height": 256, "random_crop_width": 256, "random_flip": true, "flip_horizontal": false }, "online_shuffling": true, "shuffle_buffer_size": 128, "prefetch": true, "batch_size": 8, "batches_per_epoch": null, "min_batches_per_epoch": 200, "val_batches_per_epoch": null, "min_val_batches_per_epoch": 10, "epochs": 100, "optimizer": "adam", "initial_learning_rate": 0.0001, "learning_rate_schedule": { "reduce_on_plateau": true, "reduction_factor": 0.5, "plateau_min_delta": 1e-06, "plateau_patience": 5, "plateau_cooldown": 3, "min_learning_rate": 1e-08 }, "hard_keypoint_mining": { "online_mining": false, "hard_to_easy_ratio": 2.0, "min_hard_keypoints": 2, "max_hard_keypoints": null, "loss_scale": 5.0 }, "early_stopping": { "stop_training_on_plateau": true, "plateau_min_delta": 1e-06, "plateau_patience": 10 } }, "outputs": { "save_outputs": true, "run_name": "230419_140330.multi_class_topdown.n=31", "run_name_prefix": "", "run_name_suffix": "", "runs_folder": "C:/Users/jverpeut/Desktop\\models", "tags": [ "" ], "save_visualizations": true, "delete_viz_images": true, "zip_outputs": false, "log_to_csv": true, "checkpointing": { "initial_model": false, "best_model": true, "every_epoch": false, "latest_model": false, "final_model": false }, "tensorboard": { "write_logs": false, "loss_frequency": "epoch", "architecture_graph": false, "profile_graph": false, "visualizations": true }, "zmq": { "subscribe_to_controller": true, "controller_address": "tcp://127.0.0.1:9000", "controller_polling_timeout": 10, "publish_updates": true, "publish_address": "tcp://127.0.0.1:9001" } }, "name": "", "description": "", "sleap_version": "1.3.0", "filename": "C:\\Users\\jverpeut\\AppData\\Local\\Temp\\tmpomgh3qmu\\230419_140330_training_job.json" } INFO:sleap.nn.training: INFO:sleap.nn.training:Auto-selected GPU 0 with 2885 MiB of free memory. INFO:sleap.nn.training:Using GPU 0 for acceleration. INFO:sleap.nn.training:Disabled GPU memory pre-allocation. INFO:sleap.nn.training:System: GPUs: 1/1 available Device: /physical_device:GPU:0 Available: True Initalized: False Memory growth: True INFO:sleap.nn.training: INFO:sleap.nn.training:Initializing trainer... INFO:sleap.nn.training:Loading training labels from: C:/Users/jverpeut/Desktop/labels_2_21_DominanceOpenField.v001(1).slp INFO:sleap.nn.training:Creating training and validation splits from validation fraction: 0.1 Traceback (most recent call last): File "C:\ProgramData\Anaconda3\envs\sleap1.3\Scripts\sleap-train-script.py", line 33, in