paninski-lab / lightning-pose

Accelerated pose estimation and tracking using semi-supervised convolutional networks.
https://lightning-pose.readthedocs.io
MIT License
239 stars 35 forks source link

training batch size and dali loading batch size mismatch #161

Closed Wulin-Tan closed 5 months ago

Wulin-Tan commented 6 months ago

Hi, lightning pose team: I have about 2,000 labeled images from each video (video_1, video_2, totally about 4,000 images). Now I want to train the temporal model. For the first 21 epochs, the program can run well, but when it went to the 22nd epoch, it threw the error and terminated. Any suggestion to fix this?

error

Error.txt
Epoch 22:  80%|▊| 343/428 [09:06<02:15,  0.63it/s, v_num=0, total_unsupervised_importance=0.220, train_supervised_loss=0.00342, train_heatmap_mse_loss=0.0137, train_pca_singleview_loss=4.670, train_temporal_los../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [31,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Error executing job with overrides: []
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 987, in _run
    results = self._run_stage()
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 1033, in _run_stage
    self.fit_loop.run()
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
    self.advance()
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
    self.advance(data_fetcher)
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 250, in advance
    batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 190, in run
    self._optimizer_step(batch_idx, closure)
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 268, in _optimizer_step
    call._call_lightning_module_hook(
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/pytorch/core/module.py", line 1303, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/pytorch/core/optimizer.py", line 152, in step
    step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/pytorch/strategies/strategy.py", line 239, in optimizer_step
    return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/pytorch/plugins/precision/precision.py", line 122, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/optim/optimizer.py", line 33, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/optim/adam.py", line 121, in step
    loss = closure()
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/pytorch/plugins/precision/precision.py", line 108, in _wrap_closure
    closure_result = closure()
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 144, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 129, in closure
    step_output = self._step_fn()
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 318, in _training_step
    training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/pytorch/strategies/strategy.py", line 391, in training_step
    return self.lightning_module.training_step(*args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/lightning_pose/models/base.py", line 416, in training_step
    loss_unsuper = self.evaluate_unlabeled(
  File "/root/miniconda3/lib/python3.8/site-packages/lightning_pose/models/base.py", line 372, in evaluate_unlabeled
    data_dict = self.get_loss_inputs_unlabeled(batch_dict=batch_dict)
  File "/root/miniconda3/lib/python3.8/site-packages/lightning_pose/models/heatmap_tracker_mhcrnn.py", line 272, in get_loss_inputs_unlabeled
    pred_keypoints_sf, confidence_sf = self.run_subpixelmaxima(pred_heatmaps_sf)
  File "/root/miniconda3/lib/python3.8/site-packages/lightning_pose/models/heatmap_tracker.py", line 140, in run_subpixelmaxima
    heatmaps = upsample(heatmaps)
  File "/root/miniconda3/lib/python3.8/site-packages/lightning_pose/models/heatmap_tracker.py", line 50, in upsample
    inputs_up = filter2d(inputs_up, kernel, border_type="constant")
  File "/root/miniconda3/lib/python3.8/site-packages/kornia/filters/filter.py", line 109, in filter2d
    tmp_kernel = kernel[:, None, ...].to(device=input.device, dtype=input.dtype)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train_hydra.py", line 87, in train
    trainer.fit(model=model, datamodule=data_module)
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 68, in _call_and_handle_interrupt
    trainer._teardown()
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 1010, in _teardown
    self.strategy.teardown()
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/pytorch/strategies/strategy.py", line 533, in teardown
    _optimizers_to_device(self.optimizers, torch.device("cpu"))
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/fabric/utilities/optimizer.py", line 28, in _optimizers_to_device
    _optimizer_to_device(opt, device)
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/fabric/utilities/optimizer.py", line 34, in _optimizer_to_device
    optimizer.state[p] = apply_to_collection(v, Tensor, move_data_to_device, device, allow_frozen=True)
  File "/root/miniconda3/lib/python3.8/site-packages/lightning_utilities/core/apply_func.py", line 52, in apply_to_collection
    return _apply_to_collection_slow(
  File "/root/miniconda3/lib/python3.8/site-packages/lightning_utilities/core/apply_func.py", line 104, in _apply_to_collection_slow
    v = _apply_to_collection_slow(
  File "/root/miniconda3/lib/python3.8/site-packages/lightning_utilities/core/apply_func.py", line 96, in _apply_to_collection_slow
    return function(data, *args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/fabric/utilities/apply_func.py", line 103, in move_data_to_device
    return apply_to_collection(batch, dtype=_TransferableDataType, function=batch_to)
  File "/root/miniconda3/lib/python3.8/site-packages/lightning_utilities/core/apply_func.py", line 64, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/lightning/fabric/utilities/apply_func.py", line 97, in batch_to
    data_output = data.to(device, **kwargs)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.8/site-packages/nvidia/dali/pipeline.py", line 1708, in _shutdown_pipelines
    p._shutdown()
  File "/root/miniconda3/lib/python3.8/site-packages/nvidia/dali/pipeline.py", line 358, in _shutdown
    self._pipe.Shutdown()
RuntimeError: CUDA runtime API error cudaErrorAssert (710):
device-side assert triggered
Exception ignored in: 
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.8/site-packages/nvidia/dali/pipeline.py", line 354, in __del__
    self._shutdown()
  File "/root/miniconda3/lib/python3.8/site-packages/nvidia/dali/pipeline.py", line 358, in _shutdown
    self._pipe.Shutdown()
RuntimeError: CUDA runtime API error cudaErrorAssert (710):
device-side assert triggered
Epoch 22:  80%|▊| 343/428 [09:07<02:15,  0.63it/s, v_num=0, total_unsupervised_importance=0.220, train_supervised_loss=0.00342, train_heatmap_mse_loss=0.0137, train_pca_singleview_loss=4.670, train_temporal_los
terminate called after throwing an instance of 'dali::CUDAError'
  what():  CUDA runtime API error cudaErrorAssert (710):
device-side assert triggered
Aborted (core dumped)
config file

data:
  image_orig_dims:
    height: 2160
    width: 2160
  image_resize_dims:
    height: 512
    width: 512
  data_dir: /root/autodl-tmp/DLC_LP
  video_dir: /root/autodl-tmp/DLC_LP/videos
  csv_file: CollectedData.csv
  downsample_factor: 2
  num_keypoints: 6
  keypoint_names:
  - snout
  - forepaw_L
  - forefaw_R
  - hindpaw_L
  - hindpaw_R
  - base
  mirrored_column_matches: null
  columns_for_singleview_pca: null
training:
  imgaug: dlc
  train_batch_size: 8
  val_batch_size: 32
  test_batch_size: 32
  train_prob: 0.95
  val_prob: 0.05
  train_frames: 1
  num_gpus: 1
  num_workers: 16
  early_stop_patience: 3
  unfreezing_epoch: 20
  min_epochs: 100
  max_epochs: 300
  log_every_n_steps: 10
  check_val_every_n_epoch: 5
  gpu_id: 0
  rng_seed_data_pt: 0
  rng_seed_model_pt: 0
  lr_scheduler: multisteplr
  lr_scheduler_params:
    multisteplr:
      milestones:
      - 150
      - 200
      - 250
      gamma: 0.5
model:
  losses_to_use:
  - pca_singleview
  - temporal
  backbone: resnet50_animal_ap10k
  model_type: heatmap_mhcrnn
  heatmap_loss_type: mse
  model_name: DLC_LP
dali:
  general:
    seed: 123456
  base:
    train:
      sequence_length: 32
    predict:
      sequence_length: 96
  context:
    train:
      batch_size: 16
    predict:
      sequence_length: 96
losses:
  pca_multiview:
    log_weight: 5.0
    components_to_keep: 3
    epsilon: null
  pca_singleview:
    log_weight: 5.0
    components_to_keep: 0.99
    epsilon: null
  temporal:
    log_weight: 5.0
    epsilon: 20.0
    prob_threshold: 0.05
eval:
  hydra_paths:
  - 2024-04-11/14-57-36/
  predict_vids_after_training: true
  save_vids_after_training: false
  fiftyone:
    dataset_name: test
    model_display_names:
    - test_model
    launch_app_from_script: false
    remote: true
    address: 127.0.0.1
    port: 5151
  test_videos_directory: /root/autodl-tmp/DLC_LP/videos
  saved_vid_preds_dir: null
  confidence_thresh_for_vid: 0.9
  video_file_to_plot: null
  pred_csv_files_to_plot:
  - ' '
callbacks:
  anneal_weight:
    attr_name: total_unsupervised_importance
    init_val: 0.0
    increase_factor: 0.01
    final_val: 1.0
    freeze_until_epoch: 0
hydra:
  run:
    dir: outputs/${now:%Y-%m-%d}/${now:%H-%M-%S}
  sweep:
    dir: multirun/${now:%Y-%m-%d}/${now:%H-%M-%S}
    subdir: ${hydra.job.num}

hydra_train.py

import os

import hydra
import lightning.pytorch as pl
from omegaconf import DictConfig

from lightning_pose.utils import pretty_print_cfg, pretty_print_str
from lightning_pose.utils.io import (
    check_video_paths,
    return_absolute_data_paths,
    return_absolute_path,
)
from lightning_pose.utils.predictions import predict_dataset
from lightning_pose.utils.scripts import (
    calculate_train_batches,
    compute_metrics,
    export_predictions_and_labeled_video,
    get_callbacks,
    get_data_module,
    get_dataset,
    get_imgaug_transform,
    get_loss_factories,
    get_model,
)

@hydra.main(config_path="configs", config_name="config_mirror-mouse-example")
def train(cfg: DictConfig):
    """Main fitting function, accessed from command line."""

    print("Our Hydra config file:")
    pretty_print_cfg(cfg)

    # path handling for toy data
    data_dir, video_dir = return_absolute_data_paths(data_cfg=cfg.data)

    # ----------------------------------------------------------------------------------
    # Set up data/model objects
    # ----------------------------------------------------------------------------------

    # imgaug transform
    imgaug_transform = get_imgaug_transform(cfg=cfg)

    # dataset
    dataset = get_dataset(cfg=cfg, data_dir=data_dir, imgaug_transform=imgaug_transform)

    # datamodule; breaks up dataset into train/val/test
    data_module = get_data_module(cfg=cfg, dataset=dataset, video_dir=video_dir)

    # build loss factory which orchestrates different losses
    loss_factories = get_loss_factories(cfg=cfg, data_module=data_module)

    # model
    model = get_model(cfg=cfg, data_module=data_module, loss_factories=loss_factories)

    # ----------------------------------------------------------------------------------
    # Set up and run training
    # ----------------------------------------------------------------------------------

    # logger
    logger = pl.loggers.TensorBoardLogger("tb_logs", name=cfg.model.model_name)

    # early stopping, learning rate monitoring, model checkpointing, backbone unfreezing
    callbacks = get_callbacks(cfg, early_stopping=False)

    # calculate number of batches for both labeled and unlabeled data per epoch
    limit_train_batches = calculate_train_batches(cfg, dataset)

    # set up trainer
    trainer = pl.Trainer(  # TODO: be careful with devices when scaling to multiple gpus
        accelerator="gpu",  # TODO: control from outside
        devices=1,  # TODO: control from outside
        max_epochs=cfg.training.max_epochs,
        min_epochs=cfg.training.min_epochs,
        check_val_every_n_epoch=cfg.training.check_val_every_n_epoch,
        log_every_n_steps=cfg.training.log_every_n_steps,
        callbacks=callbacks,
        logger=logger,
        limit_train_batches=limit_train_batches,
        accumulate_grad_batches=cfg.training.get("accumulate_grad_batches", 1),
        profiler=cfg.training.get("profiler", None),
    )

    # train model!
    trainer.fit(model=model, datamodule=data_module)

    # ----------------------------------------------------------------------------------
    # Post-training analysis
    # ----------------------------------------------------------------------------------
    hydra_output_directory = os.getcwd()
    print("Hydra output directory: {}".format(hydra_output_directory))
    # get best ckpt
    best_ckpt = os.path.abspath(trainer.checkpoint_callback.best_model_path)
    # check if best_ckpt is a file
    if not os.path.isfile(best_ckpt):
        raise FileNotFoundError("Cannot find checkpoint. Have you trained for too few epochs?")

    # make unaugmented data_loader if necessary
    if cfg.training.imgaug != "default":
        cfg_pred = cfg.copy()
        cfg_pred.training.imgaug = "default"
        imgaug_transform_pred = get_imgaug_transform(cfg=cfg_pred)
        dataset_pred = get_dataset(
            cfg=cfg_pred, data_dir=data_dir, imgaug_transform=imgaug_transform_pred
        )
        data_module_pred = get_data_module(cfg=cfg_pred, dataset=dataset_pred, video_dir=video_dir)
        data_module_pred.setup()
    else:
        data_module_pred = data_module

    # ----------------------------------------------------------------------------------
    # predict on all labeled frames (train/val/test)
    # ----------------------------------------------------------------------------------
    pretty_print_str("Predicting train/val/test images...")
    # compute and save frame-wise predictions
    preds_file = os.path.join(hydra_output_directory, "predictions.csv")
    predict_dataset(
        cfg=cfg,
        trainer=trainer,
        model=model,
        data_module=data_module_pred,
        ckpt_file=best_ckpt,
        preds_file=preds_file,
    )
    # compute and save various metrics
    try:
        compute_metrics(cfg=cfg, preds_file=preds_file, data_module=data_module_pred)
    except Exception as e:
        print(f"Error computing metrics\n{e}")

    # ----------------------------------------------------------------------------------
    # predict folder of videos
    # ----------------------------------------------------------------------------------
    if cfg.eval.predict_vids_after_training:
        pretty_print_str("Predicting videos...")
        if cfg.eval.test_videos_directory is None:
            filenames = []
        else:
            filenames = check_video_paths(
                return_absolute_path(cfg.eval.test_videos_directory)
            )
            vidstr = "video" if (len(filenames) == 1) else "videos"
            pretty_print_str(
                f"Found {len(filenames)} {vidstr} to predict on (in cfg.eval.test_videos_directory)"
            )

        for video_file in filenames:
            assert os.path.isfile(video_file)
            pretty_print_str(f"Predicting video: {video_file}...")
            # get save name for prediction csv file
            video_pred_dir = os.path.join(hydra_output_directory, "video_preds")
            video_pred_name = os.path.splitext(os.path.basename(video_file))[0]
            prediction_csv_file = os.path.join(video_pred_dir, video_pred_name + ".csv")
            # get save name labeled video csv
            if cfg.eval.save_vids_after_training:
                labeled_vid_dir = os.path.join(video_pred_dir, "labeled_videos")
                labeled_mp4_file = os.path.join(
                    labeled_vid_dir, video_pred_name + "_labeled.mp4"
                )
            else:
                labeled_mp4_file = None
            # predict on video
            export_predictions_and_labeled_video(
                video_file=video_file,
                cfg=cfg,
                ckpt_file=best_ckpt,
                prediction_csv_file=prediction_csv_file,
                labeled_mp4_file=labeled_mp4_file,
                trainer=trainer,
                model=model,
                data_module=data_module_pred,
                save_heatmaps=cfg.eval.get(
                    "predict_vids_after_training_save_heatmaps", False
                ),
            )
            # compute and save various metrics
            try:
                compute_metrics(
                    cfg=cfg,
                    preds_file=prediction_csv_file,
                    data_module=data_module_pred,
                )
            except Exception as e:
                print(f"Error predicting on video {video_file}:\n{e}")
                continue

    # ----------------------------------------------------------------------------------
    # predict on OOD frames
    # ----------------------------------------------------------------------------------
    # update config file to point to OOD data
    csv_file_ood = os.path.join(cfg.data.data_dir, cfg.data.csv_file).replace(
        ".csv", "_new.csv"
    )
    if os.path.exists(csv_file_ood):
        cfg_ood = cfg.copy()
        cfg_ood.data.csv_file = csv_file_ood
        cfg_ood.training.imgaug = "default"
        cfg_ood.training.train_prob = 1
        cfg_ood.training.val_prob = 0
        cfg_ood.training.train_frames = 1
        # build dataset/datamodule
        imgaug_transform_ood = get_imgaug_transform(cfg=cfg_ood)
        dataset_ood = get_dataset(
            cfg=cfg_ood, data_dir=data_dir, imgaug_transform=imgaug_transform_ood
        )
        data_module_ood = get_data_module(cfg=cfg_ood, dataset=dataset_ood, video_dir=video_dir)
        data_module_ood.setup()
        pretty_print_str("Predicting OOD images...")
        # compute and save frame-wise predictions
        preds_file_ood = os.path.join(hydra_output_directory, "predictions_new.csv")
        predict_dataset(
            cfg=cfg_ood,
            trainer=trainer,
            model=model,
            data_module=data_module_ood,
            ckpt_file=best_ckpt,
            preds_file=preds_file_ood,
        )
        # compute and save various metrics
        try:
            compute_metrics(
                cfg=cfg_ood, preds_file=preds_file_ood, data_module=data_module_ood
            )
        except Exception as e:
            print(f"Error computing metrics\n{e}")

if __name__ == "__main__":
    train()
hardware GPU L20(48GB) * 1 CPU 20 vCPU Intel(R) Xeon(R) Platinum 8457C

so I run the code to check:

code to check import os import torch import torch.nn.functional as F # Set environment variables for debugging os.environ['CUDA_LAUNCH_BLOCKING'] = '1' os.environ['HYDRA_FULL_ERROR'] = '1' # Example upsample function with debugging def upsample(inputs): assert inputs.ndim == 4, "Expected 4D tensor for upsampling" print(f"Upsampling input shape: {inputs.shape}") return F.interpolate(inputs, scale_factor=2, mode='bilinear', align_corners=True) # Example filter2d function with debugging def filter2d(input, kernel): assert kernel.ndim == 4, "Expected 4D kernel for filtering" assert input.ndim == 4, "Expected 4D input for filtering" print(f"Filtering input shape: {input.shape}, kernel shape: {kernel.shape}") return F.conv2d(input, kernel, padding=1, stride=1) # Your training code with enhanced debugging try: # Initialize model, optimizer, data loaders, etc. for epoch in range(num_epochs): for batch_idx, batch in enumerate(train_loader): try: # Debugging data batch print(f"Batch {batch_idx} shape: {batch['images'].shape}") # Forward pass outputs = model(batch) # Compute loss loss = criterion(outputs, batch) # Backward pass and optimization optimizer.zero_grad() loss.backward() optimizer.step() if batch_idx % 100 == 0: print(f"Epoch [{epoch}/{num_epochs}], Step [{batch_idx}/{len(train_loader)}], Loss: {loss.item()}") except Exception as batch_error: print(f"Error in batch {batch_idx}: {batch_error}") import traceback traceback.print_exc() except Exception as e: print(f"Error during training: {e}") import traceback traceback.print_exc()

and got the result:

num_epoch not defined Error during training: name 'num_epochs' is not defined Traceback (most recent call last): File "/tmp/ipykernel_97824/1395678165.py", line 25, in for epoch in range(num_epochs): NameError: name 'num_epochs' is not defined
themattinthehatt commented 6 months ago

I haven't seen this error before. Were you by any chance able to see how much GPU memory was in use? One thing I spotted was in your config file you have

model:
  losses_to_use:
  - temporal
  - pca_singleview

though in the data portion of your config file, you have

data
  columns_for_singleview_pca: null

which might be causing some issue? So you should either remove the pca loss from the list, or list out the keypoint indices you want to use in columns_for_singleview_pca.

To test this, you can also try setting

training:
  train_frames: 100

which will train a model but only with 100 frames. That way you can debug a bit faster, without training the model on all the labeled frames.

Finally, I would also suggest trying

data:
  image_resize_dims:
    height: 384
    width: 384

to see if that provides adequate results; that might speed up training quite a bit too.

I'll note that we're currently working on some updates for situations like yours, where the original image size is very large but the animal only occupies a small portion of the frame. We'll have a two-step pipeline, an object detection network that finds the animal, then crops around the animal, then a second pose estimation network that operates on the crop. This is still at least a month out but wanted to let you know it's in the pipeline.

Wulin-Tan commented 6 months ago

Hi, @themattinthehatt Thank you for your advice. I figured out the problem. The error seemed like a 'traffic jam' in the GPU. So I checked the config file and found that it might be the problem about train_batch_size in the training part(I called it training batch size) and context/train/batch_size in the dali part(I called it dali loading batch size). I switched them to the same number but still within GPU memory, and the problem solved. So must these two number be the same?

Wulin-Tan commented 6 months ago

I haven't seen this error before. Were you by any chance able to see how much GPU memory was in use? One thing I spotted was in your config file you have

model:
  losses_to_use:
  - temporal
  - pca_singleview

though in the data portion of your config file, you have

data
  columns_for_singleview_pca: null

which might be causing some issue? So you should either remove the pca loss from the list, or list out the keypoint indices you want to use in columns_for_singleview_pca.

To test this, you can also try setting

training:
  train_frames: 100

which will train a model but only with 100 frames. That way you can debug a bit faster, without training the model on all the labeled frames.

Finally, I would also suggest trying

data:
  image_resize_dims:
    height: 384
    width: 384

to see if that provides adequate results; that might speed up training quite a bit too.

I'll note that we're currently working on some updates for situations like yours, where the original image size is very large but the animal only occupies a small portion of the frame. We'll have a two-step pipeline, an object detection network that finds the animal, then crops around the animal, then a second pose estimation network that operates on the crop. This is still at least a month out but wanted to let you know it's in the pipeline.

Heard that you are working on this update, that is super cool! And I think this 2-step idea is important. sometimes we just need the animals rough location/area, then we can stop at step 1. And if we want more details about the bodyparts, we can move on with step 2. And after step 1 localization of the animal, step 2 prediction can be narrowed down to a smaller specific area, this would help a lot to the efficiency. And I also hope that this feature can support multiple view / multiple animals.

themattinthehatt commented 6 months ago

Hi, @themattinthehatt Thank you for your advice. I figured out the problem. The error seemed like a 'traffic jam' in the GPU. So I checked the config file and found that it might be the problem about train_batch_size in the training part(I called it training batch size) and context/train/batch_size in the dali part(I called it dali loading batch size). I switched them to the same number but still within GPU memory, and the problem solved. So must these two number be the same?

What did you change the numbers to? No, they do not need to be the same, but it's possible some memory issue was cleared up if you made one or both smaller.

Wulin-Tan commented 6 months ago

Hi, @themattinthehatt Thank you for your advice. I figured out the problem. The error seemed like a 'traffic jam' in the GPU. So I checked the config file and found that it might be the problem about train_batch_size in the training part(I called it training batch size) and context/train/batch_size in the dali part(I called it dali loading batch size). I switched them to the same number but still within GPU memory, and the problem solved. So must these two number be the same?

What did you change the numbers to? No, they do not need to be the same, but it's possible some memory issue was cleared up if you made one or both smaller.

@themattinthehatt As I showed the config at the beginning on this issue: training batch size 8, dali loading batch size 16, then it would cause the problem I raised here(Epoch 22 would show the error). training batch size 8, dali loading batch size 8, then it could move on with the training until Epoch 300 as in the config.

themattinthehatt commented 6 months ago

Cool, good to hear. It seems like maybe there was an issue with the memory? I bet if you changed dali batch size to 10 or something it should also be fine (but not necessary)