GPU memory on training and prediction

Wulin-Tan commented 5 months ago

Hi, LP team: with the same config as follows, if I run 100 epochs, it works. but when I tried to run 300 epoch, it gave the error as follows: the GPU RTX3090 24g. Any suggestion? Thank you.

here is my config file:

data:
  image_orig_dims:
    height: 2160
    width: 2160
  image_resize_dims:
    height: 256
    width: 256
  data_dir: /root/autodl-tmp/DLC_LP
  video_dir: /root/autodl-tmp/DLC_LP/videos
  csv_file: CollectedData.csv
  downsample_factor: 2
  num_keypoints: 6
  keypoint_names:
  - snout
  - forepaw_L
  - forefaw_R
  - hindpaw_L
  - hindpaw_R
  - base
  mirrored_column_matches: null
  columns_for_singleview_pca:
  - 0
  - 1
  - 2
  - 3
  - 4
  - 5
training:
  imgaug: dlc
  train_batch_size: 16
  val_batch_size: 32
  test_batch_size: 32
  train_prob: 0.95
  val_prob: 0.05
  train_frames: 1
  num_gpus: 1
  num_workers: 16
  early_stop_patience: 3
  unfreezing_epoch: 20
  min_epochs: 0
  max_epochs: 300
  log_every_n_steps: 10
  check_val_every_n_epoch: 5
  gpu_id: 0
  rng_seed_data_pt: 0
  rng_seed_model_pt: 0
  lr_scheduler: multisteplr
  lr_scheduler_params:
    multisteplr:
      milestones:
      - 150
      - 200
      - 250
      gamma: 0.5
model:
  losses_to_use:
  - pca_singleview
  - temporal
  backbone: resnet50_animal_ap10k
  model_type: heatmap_mhcrnn
  heatmap_loss_type: mse
  model_name: DLC_LP
dali:
  general:
    seed: 123456
  base:
    train:
      sequence_length: 32
    predict:
      sequence_length: 96
  context:
    train:
      batch_size: 16
    predict:
      sequence_length: 96
losses:
  pca_multiview:
    log_weight: 5.0
    components_to_keep: 3
    epsilon: null
  pca_singleview:
    log_weight: 5.0
    components_to_keep: 0.99
    epsilon: 10
  temporal:
    log_weight: 5.0
    epsilon: 10
    prob_threshold: 0.05
eval:
  hydra_paths:
  - 2024-06-12/23-44-52/
  predict_vids_after_training: true
  save_vids_after_training: false
  fiftyone:
    dataset_name: test
    model_display_names:
    - test_model
    launch_app_from_script: false
    remote: true
    address: 127.0.0.1
    port: 5151
  test_videos_directory: /root/autodl-tmp/DLC_LP/videos
  saved_vid_preds_dir: null
  confidence_thresh_for_vid: 0.9
  video_file_to_plot: null
  pred_csv_files_to_plot:
  - ' '
callbacks:
  anneal_weight:
    attr_name: total_unsupervised_importance
    init_val: 0.0
    increase_factor: 0.01
    final_val: 1.0
    freeze_until_epoch: 0
hydra:
  run:
    dir: outputs/${now:%Y-%m-%d}/${now:%H-%M-%S}
  sweep:
    dir: multirun/${now:%Y-%m-%d}/${now:%H-%M-%S}
    subdir: ${hydra.job.num}

and I got the error:

Warning: the argument `seed` shadows a Pipeline constructor argument of the same name.
Error executing job with overrides: []
Traceback (most recent call last):
  File "train_hydra.py", line 165, in train
    export_predictions_and_labeled_video(
  File "/root/miniconda3/lib/python3.8/site-packages/lightning_pose/utils/scripts.py", line 581, in export_predictions_and_labeled_video
    preds_df = predict_single_video(
  File "/root/miniconda3/lib/python3.8/site-packages/lightning_pose/utils/predictions.py", line 390, in predict_single_video
    predict_loader = vid_pred_class()
  File "/root/miniconda3/lib/python3.8/site-packages/lightning_pose/data/dali.py", line 397, in __call__
    return LitDaliWrapper(pipe, **args[self.train_stage][self.model_type])
  File "/root/miniconda3/lib/python3.8/site-packages/lightning_pose/data/dali.py", line 154, in __init__
    super().__init__(*args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/nvidia/dali/plugin/pytorch/__init__.py", line 224, in __init__
    self._first_batch = DALIGenericIterator.__next__(self)
  File "/root/miniconda3/lib/python3.8/site-packages/nvidia/dali/plugin/pytorch/__init__.py", line 239, in __next__
    outputs = self._get_outputs()
  File "/root/miniconda3/lib/python3.8/site-packages/nvidia/dali/plugin/base_iterator.py", line 385, in _get_outputs
    outputs.append(p.share_outputs())
  File "/root/miniconda3/lib/python3.8/site-packages/nvidia/dali/pipeline.py", line 1160, in share_outputs
    return self._pipe.ShareOutputs()
RuntimeError: Critical error in pipeline:
Error in GPU operator `nvidia.dali.fn.readers.video`,
which was used in the pipeline definition with the following traceback:

  File "/root/miniconda3/lib/python3.8/site-packages/lightning_pose/data/dali.py", line 76, in video_pipe
    video = fn.readers.video(

encountered:

Can't allocate 5377097728 bytes on device 0.
Current pipeline object is no longer valid.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

themattinthehatt commented 5 months ago

@Wulin-Tan this error occurs after model training, when you're running inference on new vidoes. There's no reason that increasing the number of epochs should consume more memory. Are you running this from the command line, or from a jupyter notebook?

Also, how many videos do you have in the directory /root/autodl-tmp/DLC_LP/videos, and how big/long are they?

Wulin-Tan commented 5 months ago

Hi, @themattinthehatt I run the train_hydra.py. My train_hydra.py is like:

"""Example model training script."""

import os

import hydra
import lightning.pytorch as pl
from omegaconf import DictConfig

from lightning_pose.utils import pretty_print_cfg, pretty_print_str
from lightning_pose.utils.io import (
    check_video_paths,
    return_absolute_data_paths,
    return_absolute_path,
)
from lightning_pose.utils.predictions import predict_dataset
from lightning_pose.utils.scripts import (
    calculate_train_batches,
    compute_metrics,
    export_predictions_and_labeled_video,
    get_callbacks,
    get_data_module,
    get_dataset,
    get_imgaug_transform,
    get_loss_factories,
    get_model,
)

@hydra.main(config_path="configs", config_name="config_mirror-mouse-example")
def train(cfg: DictConfig):
    """Main fitting function, accessed from command line."""

    print("Our Hydra config file:")
    pretty_print_cfg(cfg)

    # path handling for toy data
    data_dir, video_dir = return_absolute_data_paths(data_cfg=cfg.data)

    # ----------------------------------------------------------------------------------
    # Set up data/model objects
    # ----------------------------------------------------------------------------------

    # imgaug transform
    imgaug_transform = get_imgaug_transform(cfg=cfg)

    # dataset
    dataset = get_dataset(cfg=cfg, data_dir=data_dir, imgaug_transform=imgaug_transform)

    # datamodule; breaks up dataset into train/val/test
    data_module = get_data_module(cfg=cfg, dataset=dataset, video_dir=video_dir)

    # build loss factory which orchestrates different losses
    loss_factories = get_loss_factories(cfg=cfg, data_module=data_module)

    # model
    model = get_model(cfg=cfg, data_module=data_module, loss_factories=loss_factories)

    # ----------------------------------------------------------------------------------
    # Set up and run training
    # ----------------------------------------------------------------------------------

    # logger
    logger = pl.loggers.TensorBoardLogger("tb_logs", name=cfg.model.model_name)

    # early stopping, learning rate monitoring, model checkpointing, backbone unfreezing
    callbacks = get_callbacks(cfg, early_stopping=False)

    # calculate number of batches for both labeled and unlabeled data per epoch
    limit_train_batches = calculate_train_batches(cfg, dataset)

    # set up trainer
    trainer = pl.Trainer(  # TODO: be careful with devices when scaling to multiple gpus
        accelerator="gpu",  # TODO: control from outside
        devices=1,  # TODO: control from outside
        max_epochs=cfg.training.max_epochs,
        min_epochs=cfg.training.min_epochs,
        check_val_every_n_epoch=cfg.training.check_val_every_n_epoch,
        log_every_n_steps=cfg.training.log_every_n_steps,
        callbacks=callbacks,
        logger=logger,
        limit_train_batches=limit_train_batches,
        accumulate_grad_batches=cfg.training.get("accumulate_grad_batches", 1),
        profiler=cfg.training.get("profiler", None),
    )

    # train model!
    trainer.fit(model=model, datamodule=data_module)

    # ----------------------------------------------------------------------------------
    # Post-training analysis
    # ----------------------------------------------------------------------------------
    hydra_output_directory = os.getcwd()
    print("Hydra output directory: {}".format(hydra_output_directory))
    # get best ckpt
    best_ckpt = os.path.abspath(trainer.checkpoint_callback.best_model_path)
    # check if best_ckpt is a file
    if not os.path.isfile(best_ckpt):
        raise FileNotFoundError("Cannot find checkpoint. Have you trained for too few epochs?")

    # make unaugmented data_loader if necessary
    if cfg.training.imgaug != "default":
        cfg_pred = cfg.copy()
        cfg_pred.training.imgaug = "default"
        imgaug_transform_pred = get_imgaug_transform(cfg=cfg_pred)
        dataset_pred = get_dataset(
            cfg=cfg_pred, data_dir=data_dir, imgaug_transform=imgaug_transform_pred
        )
        data_module_pred = get_data_module(cfg=cfg_pred, dataset=dataset_pred, video_dir=video_dir)
        data_module_pred.setup()
    else:
        data_module_pred = data_module

    # ----------------------------------------------------------------------------------
    # predict on all labeled frames (train/val/test)
    # ----------------------------------------------------------------------------------
    pretty_print_str("Predicting train/val/test images...")
    # compute and save frame-wise predictions
    preds_file = os.path.join(hydra_output_directory, "predictions.csv")
    predict_dataset(
        cfg=cfg,
        trainer=trainer,
        model=model,
        data_module=data_module_pred,
        ckpt_file=best_ckpt,
        preds_file=preds_file,
    )
    # compute and save various metrics
    try:
        compute_metrics(cfg=cfg, preds_file=preds_file, data_module=data_module_pred)
    except Exception as e:
        print(f"Error computing metrics\n{e}")

    # ----------------------------------------------------------------------------------
    # predict folder of videos
    # ----------------------------------------------------------------------------------
    if cfg.eval.predict_vids_after_training:
        pretty_print_str("Predicting videos...")
        if cfg.eval.test_videos_directory is None:
            filenames = []
        else:
            filenames = check_video_paths(
                return_absolute_path(cfg.eval.test_videos_directory)
            )
            vidstr = "video" if (len(filenames) == 1) else "videos"
            pretty_print_str(
                f"Found {len(filenames)} {vidstr} to predict on (in cfg.eval.test_videos_directory)"
            )

        for video_file in filenames:
            assert os.path.isfile(video_file)
            pretty_print_str(f"Predicting video: {video_file}...")
            # get save name for prediction csv file
            video_pred_dir = os.path.join(hydra_output_directory, "video_preds")
            video_pred_name = os.path.splitext(os.path.basename(video_file))[0]
            prediction_csv_file = os.path.join(video_pred_dir, video_pred_name + ".csv")
            # get save name labeled video csv
            if cfg.eval.save_vids_after_training:
                labeled_vid_dir = os.path.join(video_pred_dir, "labeled_videos")
                labeled_mp4_file = os.path.join(
                    labeled_vid_dir, video_pred_name + "_labeled.mp4"
                )
            else:
                labeled_mp4_file = None
            # predict on video
            export_predictions_and_labeled_video(
                video_file=video_file,
                cfg=cfg,
                ckpt_file=best_ckpt,
                prediction_csv_file=prediction_csv_file,
                labeled_mp4_file=labeled_mp4_file,
                trainer=trainer,
                model=model,
                data_module=data_module_pred,
                save_heatmaps=cfg.eval.get(
                    "predict_vids_after_training_save_heatmaps", False
                ),
            )
            # compute and save various metrics
            try:
                compute_metrics(
                    cfg=cfg,
                    preds_file=prediction_csv_file,
                    data_module=data_module_pred,
                )
            except Exception as e:
                print(f"Error predicting on video {video_file}:\n{e}")
                continue

    # ----------------------------------------------------------------------------------
    # predict on OOD frames
    # ----------------------------------------------------------------------------------
    # update config file to point to OOD data
    csv_file_ood = os.path.join(cfg.data.data_dir, cfg.data.csv_file).replace(
        ".csv", "_new.csv"
    )
    if os.path.exists(csv_file_ood):
        cfg_ood = cfg.copy()
        cfg_ood.data.csv_file = csv_file_ood
        cfg_ood.training.imgaug = "default"
        cfg_ood.training.train_prob = 1
        cfg_ood.training.val_prob = 0
        cfg_ood.training.train_frames = 1
        # build dataset/datamodule
        imgaug_transform_ood = get_imgaug_transform(cfg=cfg_ood)
        dataset_ood = get_dataset(
            cfg=cfg_ood, data_dir=data_dir, imgaug_transform=imgaug_transform_ood
        )
        data_module_ood = get_data_module(cfg=cfg_ood, dataset=dataset_ood, video_dir=video_dir)
        data_module_ood.setup()
        pretty_print_str("Predicting OOD images...")
        # compute and save frame-wise predictions
        preds_file_ood = os.path.join(hydra_output_directory, "predictions_new.csv")
        predict_dataset(
            cfg=cfg_ood,
            trainer=trainer,
            model=model,
            data_module=data_module_ood,
            ckpt_file=best_ckpt,
            preds_file=preds_file_ood,
        )
        # compute and save various metrics
        try:
            compute_metrics(
                cfg=cfg_ood, preds_file=preds_file_ood, data_module=data_module_ood
            )
        except Exception as e:
            print(f"Error computing metrics\n{e}")

It seemed that the training is good. but the step after the training is the problem.

my videos are 2 mp4 with 30fps*10min, about 18000 frames per video.

Wulin-Tan commented 5 months ago

Hi, @themattinthehatt I found another version of train_hydra.py:

"""Example model training script."""

import hydra
from omegaconf import DictConfig

from lightning_pose.train import train

@hydra.main(config_path="configs", config_name="config_mirror-mouse-example")
def train_model(cfg: DictConfig):
    """Main fitting function, accessed from command line.

    To train a model on the example dataset provided with the Lightning Pose package with this
    script, run the following command from inside the lightning-pose directory
    (make sure you have activated your conda environment):

python scripts/train_hydra.py
```

Note there are no arguments - this tells the script to default to the example data.

To train a model on your own dataset, overwrite the default config_path and config_name args:

```
python scripts/train_hydra.py --config-path=<PATH/TO/YOUR/CONFIGS/DIR> --config-name=<CONFIG_NAME.yaml>  # noqa
```

For more information on training models, see the docs at
https://lightning-pose.readthedocs.io/en/latest/source/user_guide/training.html

"""

train(cfg)

if name == "main": train_model()



which one is the right one?

themattinthehatt commented 5 months ago

They are both basically the same, we just refactored the original script to be a function. So no need to worry about that. But if you want you can pull the latest updates by running git pull from inside the lightning-pose repo.

Those videos aren't huge, so I'm not exactly sure why there should be memory issues. You can run inference on those videos (or any other videos) separately after model training using this script here. Can you try that and let me know if it works for you? Note that you'll need to set cfg.eval.hydra_paths in the config file to point to the model you want to run inference with, and set cfg.eval.test_videos_directory in the config file to point to the video directory.

Wulin-Tan commented 5 months ago

Hi, @themattinthehatt I think I found the answer. in the config file,

cfg.eval.predict_vids_after_training=True

This parameter is set by default. that means when the training is finished, the prediction would move on automatically.

However, the problem is that: the training and the prediction would consume different amount of GPU memory. 1.Usually the training would only focus on activated part of the data and its derivatives like the parameters/files...would be deactivated and saved into disk. In my case, the training consumed only about 18G GPU memory, that is why 24G GPU is already enough. 2.However, the prediction would load the whole model into GPU, that might consume more GPU memory than the training. In my case, the model loading before prediction is about 29g, that is why it gave the error that about 5g memory cannot be allocated(29-24=5). Now I switched to a 32g GPU, and everything went well. 3.So the conclusion is that: if a big memory GPU is accessible, we can do the training and prediction in the same GPU; If we are renting a GPU server and want to save some fee, we can do the training in a GPU with less memory but do the prediction in a bigger one.

themattinthehatt commented 5 months ago

@Wulin-Tan glad you were able to find a workaround. The model itself should be the exact same size during training and inference. Training will also generally lead to a larger memory footprint (given the same batch size) since gradients will also be created.

I believe I found the source of your issue; in your config file, this part here:

dali:
  context:
    train:
      batch_size: 16
    predict:
      sequence_length: 96

the field dali.context.train.batch_size means you're using a batch size of 16 during training; but dali.context.predict.sequence_length means you're using a batch size of 96 during inference. If you lower this number (maybe start at 32 or 48) then you should be able to run inference without a larger GPU (and also automatically after training). I'd recommend testing this by training a model for 10 epochs or so and do the automatic inference afterwards; if that works without memory issues then you can delete the model and change the epochs back to their default.

Wulin-Tan commented 5 months ago

@Wulin-Tan glad you were able to find a workaround. The model itself should be the exact same size during training and inference. Training will also generally lead to a larger memory footprint (given the same batch size) since gradients will also be created.

I believe I found the source of your issue; in your config file, this part here:
dali:
  context:
    train:
      batch_size: 16
    predict:
      sequence_length: 96
the field dali.context.train.batch_size means you're using a batch size of 16 during training; but dali.context.predict.sequence_length means you're using a batch size of 96 during inference. If you lower this number (maybe start at 32 or 48) then you should be able to run inference without a larger GPU (and also automatically after training). I'd recommend testing this by training a model for 10 epochs or so and do the automatic inference afterwards; if that works without memory issues then you can delete the model and change the epochs back to their default.

Hi, @themattinthehatt I tried and it worked perfectly! Exactly what you said! Since there are some many important parameters in the config file, can you give more details about each parameter's setting / recommendation / your lab's experience... in the tutorial?

and by the way, since there are lots of parameters that have effects on GPU memory, I think the best way is like what you mentioned before that we can set up / adjust those parameters(especially resizing, batch size, sequence length), then try 5-10 epochs to take a look whether the group of parameters would be good.

themattinthehatt commented 5 months ago

@Wulin-Tan glad it worked!

Regarding the documentation, are there any parameters not covered in the docs here, or parameters that are mentioned but you think could be explained more clearly?

Wulin-Tan commented 5 months ago

@Wulin-Tan glad it worked!

Regarding the documentation, are there any parameters not covered in the docs here, or parameters that are mentioned but you think could be explained more clearly?

Hi, @themattinthehatt Yes, your explanation is very clear!

paninski-lab / lightning-pose

GPU memory on training and prediction #172