ttengwang / PDVC

End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021)
MIT License
200 stars 23 forks source link

Few questions about training #30

Closed saharshleo closed 2 years ago

saharshleo commented 2 years ago

Hello @ttengwang , I am trying to train your model from scratch (just for learning purpose). However I am facing few issues:

  1. the train_caption_file or val_caption_file does not have labels, which is being used in video_dataset.py (also in class loss). Am I using some wrong file?
  2. I tried with labels from action_proposal dataset (with captioning related part removed), but the loss_ce doesn't decrease at all, both in train and val (did you face any issues like this?). Also the loss_ce is coming in ranges of 300-400.
  3. How many epochs you trained before getting decent captions?
ttengwang commented 2 years ago
  1. Since we consider localization as a binary classification, so all action_labels is set to 0 (0 for foreground, 1 for background). If you want to perform multi-class classification, add annotations of action_labels in train_caption_file and val_caption_file.
  2. I tried to train ActivityNet Captions with the captioning head removed, and the loss and performance are reasonable. The configs are at https://github.com/ttengwang/PDVC/blob/main/cfgs/anet_c3d_props.yml
  3. 5 epochs for readable captions, and 10-25 epochs for best performance (It depends on different video features)
saharshleo commented 2 years ago

Thank you for your reply, Can you please provide the file with action labels in it

ttengwang commented 2 years ago

The Activitynet Captions datasets were labeled by different annotators with ActivityNet 1.3. So there are no official annotations of action labels for Activitynet Captions. Sorry that I didn't train the model on ActivityNet 1.3, so you may prepare a new train_caption_file by yourself.

saharshleo commented 2 years ago

Hey, I tried with your given anet_c3d_props.yml config, It is now training better and train losses are decreasing steadily. However val losses are almost constant, can you suggest what parameters I should check or tweak (given that I am training for 200 classes)

ttengwang commented 2 years ago

Based on my experience, lr, cost ratios, and loss ratios are important.

ttengwang commented 2 years ago

Also, try to set bbox_loss_coef >0.

saharshleo commented 2 years ago

Ok thank you, Keeping bbox_loss_coef = 0 means you are not using it?,

This are the current parameters:

batch_size = 16
lr = 1e-4
lr_drop = 200
weight_decay = 1e-4

num_classes = 200
num_queries = 100
d_model = 512

cls_loss_coef = 1
counter_loss_coef = 2
bbox_loss_coef = 5
giou_loss_coef = 2
self_iou_loss_coef = 2
mask_prediction_coef = 2
eos_coef = 0.1

matcher.cost_class = 1 
matcher.cost_segment = 5 
matcher.cost_giou = 2
matcher.cost_alpha = 0.25
matcher.cost_gamma = 2.0

# Deformable Detr
num_heads = 8
num_feature_levels = 4
dec_n_points = 4
enc_n_points = 4
enc_layers = 6
dec_layers = 6
transformer_ff_dim = 2048

Result: image image