roomrys commented 1 year ago

Bug description

I'm trying to train a SLEAP model with 300k training examples, and when it gets to "Building test pipeline", memory usage starts to grow. I thought that changing optimization.preload_data to false might fix it, but that didn't work.

Expected behaviour

SLEAP trains smoothly.

Actual behaviour

SLEAP freezes or just takes forever to run the make_base_pipeline command.

the slow part is a call to LabelsReader.max_height_and_width - it asks all the videos for their shape, which triggers lots of calls to SingleImageVideo._load_test_frame.

Your personal set up

Version(s): SLEAP v1.2.8
SLEAP installation method (listed here):
- [ ] Conda from package
- [x] Conda from source
- [ ] pip package
- [ ] M1 Macs

training config (single_instance_no_preload.json)

``` { "data": { "labels": { "training_labels": null, "validation_labels": null, "validation_fraction": 0.1, "test_labels": null, "split_by_inds": false, "training_inds": null, "validation_inds": null, "test_inds": null, "search_path_hints": [], "skeletons": [] }, "preprocessing": { "ensure_rgb": true, "ensure_grayscale": false, "imagenet_mode": null, "input_scaling": 1.0, "pad_to_stride": null, "resize_and_pad_to_target": true, "target_height": null, "target_width": null }, "instance_cropping": { "center_on_part": null, "crop_size": null, "crop_size_detection_padding": 16 } }, "model": { "backbone": { "leap": null, "unet": { "stem_stride": null, "max_stride": 32, "output_stride": 4, "filters": 32, "filters_rate": 1.5, "middle_block": true, "up_interpolate": true, "stacks": 1 }, "hourglass": null, "resnet": null, "pretrained_encoder": null }, "heads": { "single_instance": { "part_names": null, "sigma": 5.0, "output_stride": 4, "offset_refinement": false }, "centroid": null, "centered_instance": null, "multi_instance": null } }, "optimization": { "preload_data": false, "augmentation_config": { "rotate": true, "rotation_min_angle": -180.0, "rotation_max_angle": 180.0, "translate": false, "translate_min": -5, "translate_max": 5, "scale": true, "scale_min": 0.9, "scale_max": 1.1, "uniform_noise": true, "uniform_noise_min_val": 0.0, "uniform_noise_max_val": 10.0, "gaussian_noise": true, "gaussian_noise_mean": 5.0, "gaussian_noise_stddev": 1.0, "contrast": true, "contrast_min_gamma": 0.5, "contrast_max_gamma": 2.0, "brightness": true, "brightness_min_val": 0.0, "brightness_max_val": 10.0, "random_crop": false, "random_crop_height": 256, "random_crop_width": 256, "random_flip": false, "flip_horizontal": false }, "online_shuffling": true, "shuffle_buffer_size": 128, "prefetch": true, "batch_size": 4, "batches_per_epoch": null, "min_batches_per_epoch": 200, "val_batches_per_epoch": null, "min_val_batches_per_epoch": 10, "epochs": 200, "optimizer": "adam", "initial_learning_rate": 0.0001, "learning_rate_schedule": { "reduce_on_plateau": true, "reduction_factor": 0.5, "plateau_min_delta": 1e-06, "plateau_patience": 5, "plateau_cooldown": 3, "min_learning_rate": 1e-08 }, "hard_keypoint_mining": { "online_mining": false, "hard_to_easy_ratio": 2.0, "min_hard_keypoints": 2, "max_hard_keypoints": null, "loss_scale": 5.0 }, "early_stopping": { "stop_training_on_plateau": true, "plateau_min_delta": 1e-06, "plateau_patience": 10 } }, "outputs": { "save_outputs": true, "run_name": "221027_161513", "run_name_prefix": "", "run_name_suffix": ".single_instance", "runs_folder": "", "tags": [ "" ], "save_visualizations": true, "delete_viz_images": true, "zip_outputs": false, "log_to_csv": true, "checkpointing": { "initial_model": false, "best_model": true, "every_epoch": false, "latest_model": false, "final_model": false }, "tensorboard": { "write_logs": false, "loss_frequency": "epoch", "architecture_graph": false, "profile_graph": false, "visualizations": true }, "zmq": { "subscribe_to_controller": false, "controller_address": "tcp://127.0.0.1:9000", "controller_polling_timeout": 10, "publish_updates": false, "publish_address": "tcp://127.0.0.1:9001" } }, "name": "", "description": "", "sleap_version": "1.1.5", "filename": "single_instance.json" } ```

How to reproduce

Convert training data from COCO to SLP

labels = sleap.io.format.read(..., as_format='coco')
labels.save_file(...)

Run training using single_instance_no_preload.json provided above
```
sleap-train single_instance_no_preload.json train-all.slp
```
No error, but SLEAP just freezes and memory usage increases

talmo commented 1 year ago

Potential workarounds:

Add all images as a single SingleImageVideo instead of individual ones for each image?
Use ImgStoreVideo?

Currently following up with Dan Butler about this.

roomrys commented 1 year ago

The culprit ended up being that we cache frames in SingleImageVideo to help with switching frames in the GUI (for high resolution images). However, with 400k labeled frames being cached (not once, but twice: once for SingleImageVideo.test_frame_, and another time in SingleImageVideo.__data), we experience excessive memory growth during training.

The culprits: https://github.com/talmolab/sleap/blob/5093f6992e6214c0d528b7240331b99d0a89a62f/sleap/io/video.py#L847-L861 https://github.com/talmolab/sleap/blob/5093f6992e6214c0d528b7240331b99d0a89a62f/sleap/io/video.py#L967-L980

Why were there 400k `SingleImageVideo`s in the first place?

Our current implementation to import coco datasets creates one SingleImageVideo per image (to handle mismatched image sizes in training #1024). As a secondary PR, we could modify this to create one SingleImageVideo per image size, but we would still be left with the caching problem if many images are of different sizes.

Proposed Solution

Remove default caching for SingleImageVideo and instead allow users to pass an argument (through the GUI) to enable caching. Disable caching during training.

thejanzimmermann commented 1 year ago

has this been resolved @roomrys ? I just ran into the same issue :)

talmo commented 1 year ago

Quick update: #1243 partially fixes this.

By disabling SingleImageVideo caching, we can open projects with 10^5+ images pretty quickly now.

It doesn't solve some other issues related to annotating on those or the downstream training, which will try to cache it and do other unnecessary deserialization/serialization steps.

1242 has some fixes for this, but not all of it and we'll work on integrating those while we work on the downstream stuff.

talmolab / sleap

Excessive memory usage when training large SingleImageVideo project (400k+ frames) #1025

Bug description

Expected behaviour

Actual behaviour

Your personal set up

How to reproduce

Why were there 400k `SingleImageVideo`s in the first place?

Proposed Solution

1242 has some fixes for this, but not all of it and we'll work on integrating those while we work on the downstream stuff.

talmolab / sleap

Excessive memory usage when training large SingleImageVideo project (400k+ frames) #1025

Bug description

Expected behaviour

Actual behaviour

Your personal set up

How to reproduce

Why were there 400k SingleImageVideos in the first place?

Proposed Solution

1242 has some fixes for this, but not all of it and we'll work on integrating those while we work on the downstream stuff.

Why were there 400k `SingleImageVideo`s in the first place?