JamilHanouneh commented 5 months ago

I attempted to train a model on a hand-tracking dataset. I experimented with all available backbones, loss functions, and other hyperparameters, but the results were consistently poor. The output videos showed no detectable hand movements, even though my dataset is intact and free from issues. The model failed to produce meaningful results. Could you identify potential issues or suggest improvements to achieve better performance? As a side note, I had better results when I used the version of lightning utilities 0.10.1

themattinthehatt commented 5 months ago

I @JamilHanouneh, can you send some info for me?

your config.yaml file used to train the models
how many labeled frames you have

themattinthehatt commented 5 months ago

also, can you clarify what you mean by "lightning utilities 0.10.1"? what package does this reference? I know of lightning and lightning-pose but I'm not sure what the "utilities" references.

JamilHanouneh commented 5 months ago

for the labeled frames there are 13735 and for config this is:

data:

dimensions of training images

image_orig_dims: height: 540 width: 720

resize dimensions to streamline model creation

image_resize_dims: height: 384 width: 512

ABSOLUTE path to data directory

data_dir: /home/nsquared6/Desktop/users/Jamil/Done

ABSOLUTE path to unlabeled videos' directory

video_dir: /home/nsquared6/Desktop/users/Jamil/Done/videos

location of labels; this should be relative to `data_dir`

csv_file: CollectedData.csv

downsample heatmaps - 2 | 3 was 2

downsample_factor: 2

total number of keypoints

num_keypoints: 21

keypoint names

keypoint_names:

Thumb3
Wrist
Thumb0
Thumb1
Thumb2
Index0
Index1
Index2
Index3
Middle0
Middle1
Middle2
Middle3
Ring0
Ring1
Ring2
Ring3
Pinkie0
Pinkie1
Pinkie2
Pinkie3 mirrored_column_matches: null # I think I don't need to use it because my data are not mirrored
list of indices of keypoints used for pca singleview loss (use order of labels file)

columns_for_singleview_pca: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

training:

select from one of several predefined image/video augmentation pipelines

default- resizing only

dlc- imgaug pipeline implemented in DLC 2.0 package

dlc-top-down- dlc augmentations plus vertical and horizontal flips

imgaug: dlc

batch size of labeled data during training

train_batch_size: 16 # 16

batch size of labeled data during validation

val_batch_size: 32 # 32

batch size of labeled data during test

test_batch_size: 32 # 32

fraction of labeled data used for training

train_prob: 0.95

fraction of labeled data used for validation (remaining used for test)

val_prob: 0.05

<=1 - fraction of total train frames (determined by `train_prob`) used for training

>1 - number of total train frames used for training

train_frames: 1

number of gpus to train a single model

num_gpus: 1

number of cpu workers for data loaders

num_workers: 4

epochs over which to assess validation metrics for early stopping

early_stop_patience: 3

epoch at which backbone network weights begin updating

unfreezing_epoch: 20

max training epochs; training may exit before due to early stopping

min_epochs: 300 max_epochs: 300

frequency to log training metrics (one step is one batch)

log_every_n_steps: 10

frequency to log validation metrics

check_val_every_n_epoch: 5

select gpu for training

gpu_id: 0

rng seed for labeled batches

rng_seed_data_pt: 0

rng seed for weight initialization

rng_seed_model_pt: 0

learning rate scheduler

multisteplr | [todo - reducelronplateau]

lr_scheduler: multisteplr lr_scheduler_params: multisteplr: milestones: [150, 200, 250] gamma: 0.5

model:

list of unsupervised losses:

- "pca_singleview"

- "pca_multiview"

- "temporal"

- "unimodal_mse"

- "unimodal_kl"

- null: if null, assume a model is supervised.

losses_to_use: [pca_singleview, temporal]

backbone network:

resnet18 | resnet34 | resnet50 | resnet101 | resnet152 | resnet50_contrastive

resnet50_animalpose_apose | resnet50_animal_ap10k

resnet50_human_jhmdb | resnet50_human_res_rle | resnet50_human_top_res

efficientnet_b0 | efficientnet_b1 | efficientnet_b2

backbone: resnet50_human_top_res

prediction mode: regression | heatmap | heatmap_mhcrnn (context)

model_type: heatmap

which heatmap loss to use: mse | kl | js

heatmap_loss_type: mse

directory name for model saving

model_name: test # human_pose_experiment_1

dali: general: seed: 123456 # Keep the same for reproducibility, or change if you like

base: train: sequence_length: 32 # Good starting point, tweak later if needed predict: sequence_length: 96 # Adapt based on how you'll use the model for predictions

context: train: batch_size: 8 # Start smaller, adjust based on your GPU memory 8 predict: sequence_length: 96 # Adapt based on how you'll use the model for predictions

losses:

loss = projection onto the discarded eigenvectors

pca_multiview:

weight in front of PCA loss

log_weight: 5.0
# predictions should lie within the low-d subspace spanned by these components
components_to_keep: 3
# absolute error (in pixels) below which pca loss is zeroed out; if null, an empirical
# epsilon is computed using the labeled data
epsilon: null

loss = projection onto the discarded eigenvectors

pca_singleview:

weight in front of PCA loss

log_weight: 5.0
# predictions should lie within the low-d subspace spanned by components that describe this fraction of variance
components_to_keep: 0.99
# absolute error (in pixels) below which pca loss is zeroed out; if null, an empirical
# epsilon is computed using the labeled data
epsilon: null

loss = norm of distance between successive timepoints

temporal:

weight in front of temporal loss

log_weight: 5.0
# for epsilon insensitive rectification
# (in pixels; diffs below this are not penalized)
epsilon: 20.0
# nan removal value.
# (in prob; heatmaps with max prob values are removed)
prob_threshold: 0.05

eval:

paths to the hydra config files in the output folder, OR absolute paths to such folders.

hydra_paths: [" "]

predict?

predict_vids_after_training: true

save labeled .mp4?

save_vids_after_training: true fiftyone:

will be the name of the dataset (Mongo DB) created by FiftyOne. for video dataset, we will append dataset_name + "_video"

dataset_name: test
# if you want to manually provide a different model name to be displayed in FiftyOne
model_display_names: ["test_model"]
# whether to launch the app from the script (True), or from ipython (and have finer control over the outputs)
launch_app_from_script: false

remote: true # for LAI, must be False
address: 127.0.0.1 # ip to launch the app on.
port: 5151 # port to launch the app on.

str with an absolute path to a directory containing videos for prediction.

set to null to skip automatic video prediction from train_hydra.py script

test_videos_directory: /home/nsquared6/Desktop/users/Jamil/Done/videos

str with an absolute path to directory in which you want to save .csv with predictions

saved_vid_preds_dir: /home/nsquared6/Desktop/users/Jamil/Done/saved_vid_preds_dir

confidence threshold for plotting a vid

confidence_thresh_for_vid: 0.90

str with absolute path to the video file you want plotted with keypoints

video_file_to_plot: null

a list of strings, each points to a .csv file with predictions of a given model to the same video. will be combined with video_file_to_plot to make a visualization

pred_csv_files_to_plot: [" "]

callbacks: anneal_weight: attr_name: total_unsupervised_importance # This is the attribute that the callback is modifying. In this case, it’s total_unsupervised_importance. init_val: 0.0 # This is the initial value of the attribute. Here, it’s set to 0.0. increase_factor: 0.01 # This is the factor by which the attribute’s value is increased at each step. Here, it’s 0.01. final_val: 1.0 # This is the final value that the attribute should reach. Here, it’s 1.0. freeze_until_epoch: 0 # This parameter specifies the number of epochs for which the attribute’s value should remain at its initial value. Here, it’s 0, which means the attribute’s value starts increasing from the very first epoch.

hydra: run: dir: outputs/${now:%Y-%m-%d}/${now:%H-%M-%S} sweep: dir: multirun/${now:%Y-%m-%d}/${now:%H-%M-%S} subdir: ${hydra.job.num}

themattinthehatt commented 5 months ago

oh wow you have a lot of labeled frames! yes very surprising you're not seeing good results. is it possible to share one or more screenshots of your images? I'm curious how much variability you have across the frames.

A couple suggestions:

backbone: resnet50_human_top_res is for full human pose estimation, I'm guessing it doesn't perform so well on close-ups of hands. you might just try resnet50, which is pretrained on imagenet
image_resize_dims: try 384x384 or 256x256 - sometimes if the resizing is too big, the heatmaps are too big, and the model doesn't learn as well. i've never seen it fail completely, but i'm curious if you try smaller values what you see. with smaller resize dims you can also increase your training batch size, which will let the model train faster

are you using tensorboard to monitor training? would be useful to see the loss curves for your previous models as well as after making these changes, that's a good way to see if there is obvious lack of learning.

another comment on training: we typically train with <=1k frames, so the epoch numbers are tuned to that scale a bit. since you have so many frames one epoch is a lot of data. so you could also try changing the following (this is just based on intuition):

training
  unfreeze_epoch: 1
  min_epochs: 100
  max_epochs: 100
  check_val_every_n_epoch: 1
  lr_scheduler_params:
    multisteplr:
    milestones: [40, 60, 80]
    gamma: 0.5

I think it's best to get something workable with a fully-supervised model first, then we can think about the unsupervised losses.

JamilHanouneh commented 5 months ago

Thanks, it works, and I could get better results, but I have two questions:

Can you explain why you changed these parameters (milestones, check_val_every_n_epoch, unfreeze_epoch)? I want to get even better results. Do you have any further suggestions?

themattinthehatt commented 5 months ago

Glad to hear! Did you end up changing the backbone and the resizing dims? I would be surprised if the trraining params alone led to much better results.

unfreeze_epoch: the network has two components - the backbone (usually a resnet-50) and the head (which maps the backbone features to actual per-keypoint heatmaps). The head is always randomly initialized. The backbone is usually initialized with pretrained weights. When training we freeze the backbone weights for some number of epochs, to let weights of the head learn something meaningful first. Then we unfreeze the backbone and let all the weights of the model train. It's an open question of what the right epoch to unfreeze the backbone is, but in your case you have so many labeled images that by the time you get through one epoch the head weights are probably already in a good enough place to unfreeze the backbone.

check_val_every_n_epoch: this won't affect model training, this is just how often the validation data is run through the model for logging. Since I suggested you decrease the overall number of epochs (due to the large size of the dataset) I figured you might as well log the validation loss every epoch instead of every 5.

milestones: we use the Adam optimizer for learning, which dynamically updates the step size for each weight during training, but there is still an overall learning rate that needs to be set. There are many schemes for setting the learning rate, or changing it over time, and the one that we use is just to periodically halve the learning rate. So with milestones=[40, 60, 80] you are halving the overall learning rate at each of those milestones, which allows the network to settle down in local minima easier. If the learning rate is too big you keep jumping over local minima; if it's too small you might get stuck in a bad local minimum and never escape.

As far as further suggestions, I'll wait until I hear back about backbone and resizing dims before making any other suggestions. BTW how long did it take to train? And what type of GPU are you using?

Wulin-Tan commented 5 months ago

Hi, @themattinthehatt this issue is interesting. Can you give more details about the backbone in LP? like what kinds of backbone we can incorporate into LP model?

themattinthehatt commented 5 months ago

@Wulin-Tan we offer a decent number of backbone options, though we've only thoroughly explored resnet-50s. I updated the documention so you can see more of the available options.

@JamilHanouneh in looking up the refs for some of the other backbones I stumbled across this page showing a decent number of backbones that have been pretrained on hand data; it would be quite easy for you to update the backbone code in lightning pose in order to use one of these. You can see how we incorporate other backbones from this same set of MMPose models here.

Are you by any chance using one of these publicly available datasets?

themattinthehatt commented 5 months ago

@YitingChang @JamilHanouneh since you both are interested in tracking hands I figured I'd take some time next week to add one of the pretrained hand backbones into the repo; will update you here when that's done

themattinthehatt commented 5 months ago

@YitingChang @JamilHanouneh I just added a backbone pretrained on the OneHand10k dataset from MMPose.

To use this backbone, first run git pull from inside your lightning-pose repo to get the code updates. Then, in your config file, set the backbone to

model:
  backbone: resnet50_human_hand

The first time you do this you'll see the weights being downloaded from MMPose, and then the model will train like normal. Please let me know how this works for you!

JamilHanouneh commented 5 months ago

@themattinthehatt Thanks for adding the new backbone! I'll give it a try and update you on the results.

By the way, regarding your previous questions:

Did you end up changing the backbone and the resizing dimensions? Yes, I did. I found that adjusting the resizing dimensions had a more significant impact. How long did it take to train? It took about 19 hours. What type of GPU are you using? I'm using an RTX 4090 paired with a Ryzen 9 processor. I'll keep you posted on how the new backbone performs!

paninski-lab / lightning-pose

Poor Training Results on Hand Tracking Dataset Despite Using Various Backbones and Loss Functions #166

dimensions of training images

resize dimensions to streamline model creation

ABSOLUTE path to data directory

ABSOLUTE path to unlabeled videos' directory

location of labels; this should be relative to data_dir

downsample heatmaps - 2 | 3 was 2

total number of keypoints

keypoint names

list of indices of keypoints used for pca singleview loss (use order of labels file)

select from one of several predefined image/video augmentation pipelines

default- resizing only

dlc- imgaug pipeline implemented in DLC 2.0 package

dlc-top-down- dlc augmentations plus vertical and horizontal flips

batch size of labeled data during training

batch size of labeled data during validation

batch size of labeled data during test

fraction of labeled data used for training

fraction of labeled data used for validation (remaining used for test)

<=1 - fraction of total train frames (determined by train_prob) used for training

>1 - number of total train frames used for training

number of gpus to train a single model

number of cpu workers for data loaders

epochs over which to assess validation metrics for early stopping

epoch at which backbone network weights begin updating

max training epochs; training may exit before due to early stopping

frequency to log training metrics (one step is one batch)

frequency to log validation metrics

select gpu for training

rng seed for labeled batches

rng seed for weight initialization

learning rate scheduler

multisteplr | [todo - reducelronplateau]

list of unsupervised losses:

- "pca_singleview"

- "pca_multiview"

- "temporal"

- "unimodal_mse"

- "unimodal_kl"

- null: if null, assume a model is supervised.

backbone network:

resnet18 | resnet34 | resnet50 | resnet101 | resnet152 | resnet50_contrastive

resnet50_animalpose_apose | resnet50_animal_ap10k

resnet50_human_jhmdb | resnet50_human_res_rle | resnet50_human_top_res

efficientnet_b0 | efficientnet_b1 | efficientnet_b2

prediction mode: regression | heatmap | heatmap_mhcrnn (context)

which heatmap loss to use: mse | kl | js

directory name for model saving

loss = projection onto the discarded eigenvectors

weight in front of PCA loss

loss = projection onto the discarded eigenvectors

weight in front of PCA loss

loss = norm of distance between successive timepoints

weight in front of temporal loss

paths to the hydra config files in the output folder, OR absolute paths to such folders.

predict?

save labeled .mp4?

will be the name of the dataset (Mongo DB) created by FiftyOne. for video dataset, we will append dataset_name + "_video"

str with an absolute path to a directory containing videos for prediction.

set to null to skip automatic video prediction from train_hydra.py script

str with an absolute path to directory in which you want to save .csv with predictions

confidence threshold for plotting a vid

str with absolute path to the video file you want plotted with keypoints

a list of strings, each points to a .csv file with predictions of a given model to the same video. will be combined with video_file_to_plot to make a visualization

location of labels; this should be relative to `data_dir`

<=1 - fraction of total train frames (determined by `train_prob`) used for training