Auto-crop defaults back to same size (causing OOM on cluster?)

jverpeut commented 9 months ago

In the gui I am checking the box for auto-crop for centered-instance, but post-training that option doesn't seem to stick and the crop_size is defaulting back to 448. Is there another value I should be adjusting to allow auto-crop to function? This was throwing a size-related error in 1.3.2, but after updating to 1.3.3 this is gone, but I still can't run inference through an entire video. Inference is predicting for 20 frames on the gui, but since I can't get 1.3.3 to work on my laptop, I have been using our university cluster. Could the inability to crop the video result in the tensorflow error message I am seeing (copied below)? Any help would be appreciated.

As you will see, I am trying to track 3 animals in these videos and so far my models are doing very poorly.

Initial config:

```json { "data": { "labels": { "training_labels": "/data/jverpeut/Jaime/MultiVideolabels_2_21_2_24DominanceOpenField.v001_10.7.23.slp", "validation_labels": null, "validation_fraction": 0.1, "test_labels": null, "split_by_inds": false, "training_inds": [ 190, 74, 168, 7, 359, 208, 252, 126, 373, 315, 70, 285, 120, 317, 335, 224, 307, 182, 341, 219, 131, 184, 30, 28, 226, 250, 105, 367, 276, 358, 148, 62, 152, 134, 355, 18, 140, 272, 50, 55, 154, 342, 43, 350, 248, 204, 14, 221, 171, 6, 66, 313, 325, 218, 90, 124, 80, 115, 39, 400, 302, 56, 260, 174, 5, 277, 61, 281, 225, 324, 306, 153, 111, 139, 79, 270, 372, 59, 320, 125, 175, 390, 231, 181, 21, 177, 327, 356, 201, 114, 378, 45, 392, 381, 301, 158, 234, 300, 249, 198, 63, 178, 136, 104, 332, 279, 363, 52, 242, 15, 360, 85, 263, 192, 284, 123, 81, 12, 38, 253, 94, 274, 128, 366, 141, 308, 402, 167, 180, 211, 93, 187, 60, 143, 194, 289, 213, 193, 305, 173, 216, 261, 54, 112, 160, 142, 394, 233, 144, 240, 196, 386, 35, 293, 202, 32, 297, 244, 286, 27, 96, 265, 71, 362, 257, 87, 44, 395, 319, 357, 346, 290, 235, 384, 99, 29, 73, 9, 11, 246, 343, 388, 197, 75, 195, 58, 78, 113, 65, 163, 295, 88, 179, 214, 183, 110, 404, 34, 191, 3, 377, 259, 215, 53, 401, 238, 22, 31, 145, 209, 100, 4, 49, 304, 292, 130, 13, 273, 220, 294, 103, 287, 149, 10, 47, 146, 303, 370, 336, 205, 329, 26, 348, 222, 17, 311, 237, 312, 403, 118, 399, 368, 323, 0, 318, 77, 157, 33, 119, 46, 376, 275, 199, 387, 189, 161, 314, 57, 330, 339, 188, 169, 230, 255, 92, 51, 241, 135, 266, 375, 282, 2, 20, 95, 212, 321, 207, 326, 40, 268, 133, 228, 385, 127, 217, 129, 67, 232, 223, 108, 369, 389, 345, 229, 365, 41, 64, 334, 397, 162, 159, 82, 322, 298, 107, 254, 1, 374, 353, 200, 155, 102, 361, 310, 121, 97, 380, 383, 210, 116, 349, 379, 185, 76, 264, 42, 267, 351, 203, 291, 117, 36, 186, 364, 48, 328, 309, 68, 83, 166, 122, 239, 393, 288, 245, 371, 247, 109, 299, 69, 227, 101, 256, 296, 91, 382, 19, 37, 352, 156, 251, 176, 170, 84, 8 ], "validation_inds": [ 337, 23, 25, 138, 258, 262, 72, 150, 391, 137, 236, 271, 280, 164, 340, 278, 331, 347, 172, 151, 269, 206, 243, 333, 24, 98, 147, 132, 338, 283, 396, 89, 344, 106, 165, 16, 354, 316, 398, 86 ], "test_inds": null, "search_path_hints": [ "" ], "skeletons": [] }, "preprocessing": { "ensure_rgb": false, "ensure_grayscale": false, "imagenet_mode": null, "input_scaling": 1.0, "pad_to_stride": 1, "resize_and_pad_to_target": true, "target_height": 1088, "target_width": 1456 }, "instance_cropping": { "center_on_part": "tail base", "crop_size": null, "crop_size_detection_padding": 16 } }, "model": { "backbone": { "leap": null, "unet": { "stem_stride": null, "max_stride": 32, "output_stride": 4, "filters": 24, "filters_rate": 2.0, "middle_block": false, "up_interpolate": false, "stacks": 1 }, "hourglass": null, "resnet": null, "pretrained_encoder": null }, "heads": { "single_instance": null, "centroid": null, "centered_instance": { "anchor_part": "tail base", "part_names": [ "nose", "R front paw", "L front paw", "centroid", "R rear paw", "L rear paw", "tail base", "tail mid", "tail tip" ], "sigma": 2.5, "output_stride": 4, "loss_weight": 1.0, "offset_refinement": false }, "multi_instance": null, "multi_class_bottomup": null, "multi_class_topdown": null }, "base_checkpoint": null }, "optimization": { "preload_data": true, "augmentation_config": { "rotate": true, "rotation_min_angle": -180.0, "rotation_max_angle": 180.0, "translate": false, "translate_min": -5, "translate_max": 5, "scale": false, "scale_min": 0.9, "scale_max": 1.1, "uniform_noise": false, "uniform_noise_min_val": 0.0, "uniform_noise_max_val": 10.0, "gaussian_noise": false, "gaussian_noise_mean": 5.0, "gaussian_noise_stddev": 1.0, "contrast": false, "contrast_min_gamma": 0.5, "contrast_max_gamma": 2.0, "brightness": false, "brightness_min_val": 0.0, "brightness_max_val": 10.0, "random_crop": false, "random_crop_height": 256, "random_crop_width": 256, "random_flip": false, "flip_horizontal": false }, "online_shuffling": true, "shuffle_buffer_size": 128, "prefetch": true, "batch_size": 4, "batches_per_epoch": 273, "min_batches_per_epoch": 200, "val_batches_per_epoch": 30, "min_val_batches_per_epoch": 10, "epochs": 200, "optimizer": "adam", "initial_learning_rate": 0.0001, "learning_rate_schedule": { "reduce_on_plateau": true, "reduction_factor": 0.5, "plateau_min_delta": 1e-06, "plateau_patience": 5, "plateau_cooldown": 3, "min_learning_rate": 1e-08 }, "hard_keypoint_mining": { "online_mining": false, "hard_to_easy_ratio": 2.0, "min_hard_keypoints": 2, "max_hard_keypoints": null, "loss_scale": 5.0 }, "early_stopping": { "stop_training_on_plateau": true, "plateau_min_delta": 1e-08, "plateau_patience": 10 } }, "outputs": { "save_outputs": true, "run_name": "231008_155608.centered_instance.n=405", "run_name_prefix": "", "run_name_suffix": "", "runs_folder": "/data/jverpeut/Jaime/models", "tags": [ "" ], "save_visualizations": true, "delete_viz_images": true, "zip_outputs": false, "log_to_csv": true, "checkpointing": { "initial_model": false, "best_model": true, "every_epoch": false, "latest_model": false, "final_model": false }, "tensorboard": { "write_logs": false, "loss_frequency": "epoch", "architecture_graph": false, "profile_graph": false, "visualizations": true }, "zmq": { "subscribe_to_controller": true, "controller_address": "tcp://127.0.0.1:9000", "controller_polling_timeout": 10, "publish_updates": true, "publish_address": "tcp://127.0.0.1:9001" } }, "name": "", "description": "", "sleap_version": "1.3.3", "filename": "/data/jverpeut/Jaime/models/231008_155608.centered_instance.n=405/initial_config.json" } ```

post-training config:

```json { "data": { "labels": { "training_labels": "/data/jverpeut/Jaime/MultiVideolabels_2_21_2_24DominanceOpenField.v001_10.7.23.slp", "validation_labels": null, "validation_fraction": 0.1, "test_labels": null, "split_by_inds": false, "training_inds": [ 394, 225, 135, 335, 143, 260, 403, 168, 385, 375, 347, 50, 33, 273, 267, 191, 316, 332, 206, 195, 275, 295, 182, 186, 5, 190, 216, 321, 370, 47, 92, 3, 229, 101, 265, 396, 2, 70, 130, 307, 60, 194, 112, 24, 313, 9, 317, 43, 376, 203, 278, 309, 146, 91, 390, 133, 280, 121, 30, 185, 379, 373, 44, 297, 152, 223, 31, 222, 20, 125, 384, 57, 179, 178, 318, 202, 41, 61, 37, 393, 171, 21, 183, 291, 256, 138, 377, 198, 7, 368, 234, 374, 141, 119, 288, 39, 241, 388, 62, 323, 0, 204, 252, 269, 214, 88, 286, 246, 284, 73, 305, 157, 211, 196, 48, 271, 175, 122, 77, 147, 4, 398, 45, 124, 174, 245, 289, 293, 156, 221, 110, 148, 300, 364, 32, 27, 218, 35, 97, 361, 343, 400, 36, 322, 310, 255, 285, 1, 244, 151, 358, 357, 140, 327, 184, 144, 118, 352, 228, 54, 257, 339, 344, 55, 53, 306, 219, 217, 235, 369, 329, 187, 99, 15, 16, 114, 360, 102, 25, 247, 116, 142, 176, 308, 304, 226, 397, 239, 401, 28, 106, 96, 111, 250, 233, 337, 85, 177, 136, 166, 74, 392, 324, 220, 341, 173, 137, 104, 383, 224, 87, 189, 279, 120, 160, 283, 10, 320, 161, 292, 353, 315, 290, 200, 325, 355, 312, 90, 180, 356, 294, 301, 303, 340, 299, 108, 154, 197, 150, 109, 34, 131, 240, 67, 89, 18, 215, 172, 105, 58, 362, 199, 94, 243, 391, 113, 330, 236, 167, 169, 389, 165, 287, 351, 93, 296, 262, 127, 372, 22, 56, 14, 63, 181, 68, 386, 52, 404, 268, 139, 59, 17, 402, 378, 78, 83, 49, 314, 72, 100, 231, 261, 363, 129, 270, 23, 95, 272, 282, 84, 64, 145, 248, 82, 207, 331, 249, 263, 380, 345, 342, 134, 232, 66, 238, 11, 338, 258, 188, 359, 212, 259, 208, 237, 79, 326, 348, 281, 126, 311, 123, 81, 395, 192, 170, 75, 103, 12, 76, 164, 367, 371, 80, 128, 46, 51, 115, 350, 354, 254, 209, 153, 349, 381, 71, 266, 19, 86, 264, 8, 205, 29, 201, 277, 274 ], "validation_inds": [ 242, 40, 399, 213, 302, 251, 162, 6, 107, 193, 382, 328, 230, 253, 155, 42, 333, 276, 163, 117, 227, 132, 346, 26, 158, 149, 210, 334, 65, 336, 298, 365, 366, 13, 38, 387, 159, 319, 98, 69 ], "test_inds": null, "search_path_hints": [ "", "" ], "skeletons": [ { "directed": true, "graph": { "name": "Skeleton-1", "num_edges_inserted": 19 }, "links": [ { "edge_insert_idx": 12, "key": 0, "source": { "py/object": "sleap.skeleton.Node", "py/state": { "py/tuple": [ "nose", 1.0 ] } }, "target": { "py/object": "sleap.skeleton.Node", "py/state": { "py/tuple": [ "centroid", 1.0 ] } }, "type": { "py/reduce": [ { "py/type": "sleap.skeleton.EdgeType" }, { "py/tuple": [ 1 ] } ] } }, { "edge_insert_idx": 16, "key": 0, "source": { "py/object": "sleap.skeleton.Node", "py/state": { "py/tuple": [ "R front paw", 1.0 ] } }, "target": { "py/id": 2 }, "type": { "py/id": 3 } }, { "edge_insert_idx": 15, "key": 0, "source": { "py/object": "sleap.skeleton.Node", "py/state": { "py/tuple": [ "L front paw", 1.0 ] } }, "target": { "py/id": 2 }, "type": { "py/id": 3 } }, { "edge_insert_idx": 5, "key": 0, "source": { "py/id": 2 }, "target": { "py/object": "sleap.skeleton.Node", "py/state": { "py/tuple": [ "tail base", 1.0 ] } }, "type": { "py/id": 3 } }, { "edge_insert_idx": 17, "key": 0, "source": { "py/id": 2 }, "target": { "py/object": "sleap.skeleton.Node", "py/state": { "py/tuple": [ "L rear paw", 1.0 ] } }, "type": { "py/id": 3 } }, { "edge_insert_idx": 18, "key": 0, "source": { "py/id": 2 }, "target": { "py/object": "sleap.skeleton.Node", "py/state": { "py/tuple": [ "R rear paw", 1.0 ] } }, "type": { "py/id": 3 } }, { "edge_insert_idx": 6, "key": 0, "source": { "py/id": 6 }, "target": { "py/object": "sleap.skeleton.Node", "py/state": { "py/tuple": [ "tail mid", 1.0 ] } }, "type": { "py/id": 3 } }, { "edge_insert_idx": 7, "key": 0, "source": { "py/object": "sleap.skeleton.Node", "py/state": { "py/tuple": [ "tail tip", 1.0 ] } }, "target": { "py/id": 9 }, "type": { "py/id": 3 } } ], "multigraph": true, "nodes": [ { "id": { "py/id": 1 } }, { "id": { "py/id": 4 } }, { "id": { "py/id": 5 } }, { "id": { "py/id": 2 } }, { "id": { "py/id": 8 } }, { "id": { "py/id": 7 } }, { "id": { "py/id": 6 } }, { "id": { "py/id": 9 } }, { "id": { "py/id": 10 } } ] } ] }, "preprocessing": { "ensure_rgb": false, "ensure_grayscale": false, "imagenet_mode": null, "input_scaling": 1.0, "pad_to_stride": 1, "resize_and_pad_to_target": true, "target_height": 1088, "target_width": 1456 }, "instance_cropping": { "center_on_part": "tail base", "crop_size": 448, "crop_size_detection_padding": 16 } }, "model": { "backbone": { "leap": null, "unet": { "stem_stride": null, "max_stride": 32, "output_stride": 4, "filters": 24, "filters_rate": 2.0, "middle_block": false, "up_interpolate": false, "stacks": 1 }, "hourglass": null, "resnet": null, "pretrained_encoder": null }, "heads": { "single_instance": null, "centroid": null, "centered_instance": { "anchor_part": "tail base", "part_names": [ "nose", "R front paw", "L front paw", "centroid", "R rear paw", "L rear paw", "tail base", "tail mid", "tail tip" ], "sigma": 2.5, "output_stride": 4, "loss_weight": 1.0, "offset_refinement": false }, "multi_instance": null, "multi_class_bottomup": null, "multi_class_topdown": null }, "base_checkpoint": null }, "optimization": { "preload_data": true, "augmentation_config": { "rotate": true, "rotation_min_angle": -180.0, "rotation_max_angle": 180.0, "translate": false, "translate_min": -5, "translate_max": 5, "scale": false, "scale_min": 0.9, "scale_max": 1.1, "uniform_noise": false, "uniform_noise_min_val": 0.0, "uniform_noise_max_val": 10.0, "gaussian_noise": false, "gaussian_noise_mean": 5.0, "gaussian_noise_stddev": 1.0, "contrast": false, "contrast_min_gamma": 0.5, "contrast_max_gamma": 2.0, "brightness": false, "brightness_min_val": 0.0, "brightness_max_val": 10.0, "random_crop": false, "random_crop_height": 256, "random_crop_width": 256, "random_flip": false, "flip_horizontal": false }, "online_shuffling": true, "shuffle_buffer_size": 128, "prefetch": true, "batch_size": 4, "batches_per_epoch": 273, "min_batches_per_epoch": 200, "val_batches_per_epoch": 30, "min_val_batches_per_epoch": 10, "epochs": 200, "optimizer": "adam", "initial_learning_rate": 0.0001, "learning_rate_schedule": { "reduce_on_plateau": true, "reduction_factor": 0.5, "plateau_min_delta": 1e-06, "plateau_patience": 5, "plateau_cooldown": 3, "min_learning_rate": 1e-08 }, "hard_keypoint_mining": { "online_mining": false, "hard_to_easy_ratio": 2.0, "min_hard_keypoints": 2, "max_hard_keypoints": null, "loss_scale": 5.0 }, "early_stopping": { "stop_training_on_plateau": true, "plateau_min_delta": 1e-08, "plateau_patience": 10 } }, "outputs": { "save_outputs": true, "run_name": "231008_155608.centered_instance.n=405", "run_name_prefix": "", "run_name_suffix": "", "runs_folder": "/data/jverpeut/Jaime/models", "tags": [ "" ], "save_visualizations": true, "delete_viz_images": true, "zip_outputs": false, "log_to_csv": true, "checkpointing": { "initial_model": false, "best_model": true, "every_epoch": false, "latest_model": false, "final_model": false }, "tensorboard": { "write_logs": false, "loss_frequency": "epoch", "architecture_graph": false, "profile_graph": false, "visualizations": true }, "zmq": { "subscribe_to_controller": true, "controller_address": "tcp://127.0.0.1:9000", "controller_polling_timeout": 10, "publish_updates": true, "publish_address": "tcp://127.0.0.1:9001" } }, "name": "", "description": "", "sleap_version": "1.3.3", "filename": "/data/jverpeut/Jaime/models/231008_155608.centered_instance.n=405/training_config.json" } ```

Error message:

```bash ===================================================================== This module is intended solely for building or source activating user python environments, i.e., mamba create -n myenv -c conda-forge or source activate myenv To list available environments, run: mamba info --envs See our docs: https://links.asu.edu/solpy Any other use is NOT TESTED. ===================================================================== ++ sed 's/'\''//g' ++ sed '5q;d' /data/jverpeut/Jaime/Filelist.txt + VIDEO_PATH=/data/jverpeut/Jaime/Videos/2_7_DominanceOpenField_C229RC229LC229RL.mp4 + sleap-track /data/jverpeut/Jaime/Videos/2_7_DominanceOpenField_C229RC229LC229RL.mp4 -m 231008_155608.centered_instance.n=405 -m 231008_145056.centroid.n=405 -n 3 --tracking.tracker flow --tracking.pre_cull_to_target 3 --tracking.post_connect_single_breaks 1 --tracking.similarity instance --tracking.target_instance_count 3 --tracking.match hungarian 2023-10-08 21:22:47.227609: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-10-08 21:22:50.642322: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 78978 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:41:00.0, compute capability: 8.0 2023-10-08 21:23:01.588937: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8201 2023-10-08 21:30:44.903308: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 21:30:52.292461: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 21:32:52.914209: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 21:33:53.993039: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 21:35:15.329994: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 21:35:21.979541: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 21:43:32.434415: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 21:52:22.265157: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 21:52:29.608579: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 21:52:48.091805: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 21:53:31.049553: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 21:53:41.472349: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 21:53:42.969871: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 21:53:46.981291: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 21:54:07.510827: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 21:54:09.535666: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 21:54:15.877706: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 21:54:18.138306: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 21:54:40.423754: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 21:54:48.877853: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 21:58:55.050651: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:02:31.410325: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:02:31.515455: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:02:43.945052: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:02:47.321172: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:03:40.873920: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:07:26.805396: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:07:31.402690: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:08:14.323231: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:08:15.672661: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:08:21.117856: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:08:46.522644: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:08:56.899609: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:10:21.956768: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:10:22.082383: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:10:22.576732: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:10:28.644099: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:10:29.740268: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:10:42.112307: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:10:53.671171: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:11:12.347726: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:11:21.348641: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:11:33.285381: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:11:33.408270: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:11:33.928642: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:11:34.228205: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:11:35.309356: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:11:35.712713: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:11:36.139159: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:11:38.267027: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:11:46.721702: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:11:47.145198: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:20:30.367670: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:24:09.928652: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled 2023-10-08 22:24:19.760682: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled /var/spool/slurmd/job10250462/slurm_script: line 27: 4038485 Killed sleap-track "$VIDEO_PATH" -m 231008_155608.centered_instance.n=405 -m 231008_145056.centroid.n=405 -n 3 --tracking.tracker flow --tracking.pre_cull_to_target 3 --tracking.post_connect_single_breaks 1 --tracking.similarity instance --tracking.target_instance_count 3 --tracking.match hungarian slurmstepd: error: Detected 1 oom_kill event in StepId=10250462.batch. Some of the step tasks have been OOM Killed. ```

roomrys commented 9 months ago

Hi @jverpeut,

The auto-crop function will find the largest labeled instance and crop to that instance's size. It is totally possible that 448 is the maximum size which has been labeled and thus auto-crop automatically re-crops to this same size. A test would be to label another instance that is wayyy too large and ensure that auto-crop finds this instance - although I would think this unnecessary unless you think that there is no way your largest labeled instance is 448 pixels?

The error you are getting indicates an Out Of Memory issue your job on the cluster uses more memory than it had allocated to run that job. This is explained more in-depth here. Are you able to allocate more memory for your job? Also, do you know how much memory you had allocated when receiving this error?

Thanks, Liezl

jverpeut commented 9 months ago

Liezl, Thank you, that clarifies that function for me. I measured it in a different program and it is 448 pixels.

I was able to speak to my IT regarding the memory needs and fixed the OOM error. Thank you

Jess

roomrys commented 9 months ago

Ok neat, I'll close this and convert to a discussion since it might help others!

talmolab / sleap

Auto-crop defaults back to same size (causing OOM on cluster?) #1537