AdaTAD works worse than expected on IKEA ASM dataset

tongda commented 1 month ago

Hi, I have tried to train a AdaTAD model on IKEA ASM dataset. I followed the THUMOS config using VideoMAE base model.

The final epoch output is:

2024-07-15 09:17:18 Train INFO: [Train]: [059][00050/00126]  Loss=0.5143  cls_loss=0.2856  reg_loss=0.2287  lr_backbone=3.9e-05  lr_det=3.9e-05  mem=4993MB
2024-07-15 09:22:19 Train INFO: [Train]: [059][00100/00126]  Loss=0.5090  cls_loss=0.2780  reg_loss=0.2310  lr_backbone=3.8e-05  lr_det=3.8e-05  mem=4993MB
2024-07-15 09:25:03 Train INFO: [Train]: [059][00126/00126]  Loss=0.5022  cls_loss=0.2728  reg_loss=0.2294  lr_backbone=3.8e-05  lr_det=3.8e-05  mem=4993MB

The evaluation result is:

2024-07-15 09:32:55 Train INFO: Evaluation starts...
2024-07-15 09:32:57 Train INFO: Loaded annotations from validation subset.
2024-07-15 09:32:57 Train INFO: Number of ground truth instances: 0
2024-07-15 09:32:57 Train INFO: Number of predictions: 234000
2024-07-15 09:32:57 Train INFO: Fixed threshold for tiou score: [0.3, 0.4, 0.5, 0.6, 0.7]
2024-07-15 09:32:57 Train INFO: Average-mAP:  nan (%)
2024-07-15 09:32:57 Train INFO: mAP at tIoU 0.30 is  nan%
2024-07-15 09:32:57 Train INFO: mAP at tIoU 0.40 is  nan%
2024-07-15 09:32:57 Train INFO: mAP at tIoU 0.50 is  nan%
2024-07-15 09:32:57 Train INFO: mAP at tIoU 0.60 is  nan%
2024-07-15 09:32:57 Train INFO: mAP at tIoU 0.70 is  nan%
2024-07-15 09:32:57 Train INFO: Training Over...

Using the model to infer a test video, I try to mark the actions at the bottom of the frame (first bar is GT, second bar is predicted). From the snapshot below, we can see that most of the actions are wrong.

When processing the dataset, I remove 'NA' label. No more extra processing. Any idea about how to improve this?

sming256 commented 1 month ago

Hi @tongda, please check your ground truth JSON file.

I saw that 2024-07-15 09:32:57 Train INFO: Number of ground truth instances: 0, indicating that there is no ground truth actions.

tongda commented 1 month ago

Yes, you are right. I fixed it and rerun the test.py. Here is the result:

2024-07-16 23:40:32 Test INFO: Loaded annotations from testing subset.
2024-07-16 23:40:32 Test INFO: Number of ground truth instances: 1855
2024-07-16 23:40:32 Test INFO: Number of predictions: 234000
2024-07-16 23:40:32 Test INFO: Fixed threshold for tiou score: [0.3, 0.4, 0.5, 0.6, 0.7]
2024-07-16 23:40:32 Test INFO: Average-mAP: 39.07 (%)
2024-07-16 23:40:32 Test INFO: mAP at tIoU 0.30 is 50.29%
2024-07-16 23:40:32 Test INFO: mAP at tIoU 0.40 is 46.92%
2024-07-16 23:40:32 Test INFO: mAP at tIoU 0.50 is 40.58%
2024-07-16 23:40:32 Test INFO: mAP at tIoU 0.60 is 34.26%
2024-07-16 23:40:32 Test INFO: mAP at tIoU 0.70 is 23.32%
2024-07-16 23:40:32 Test INFO: Testing Over...

This makes sense now. However, I hope to make it better, any suggestions to improve it?

sming256 commented 1 month ago

Typically, you can optimize the following 4 hyper-parameters for better performance in end-to-end training.

the number of feature pyramid levels.
the weight of regression loss.
the number of training epochs.
the learning rate for the adapter.

tongda commented 1 month ago

Let me try to understand these hyper-parameters. Correct me if I am wrong, please.

Typically, you can optimize the following 4 hyper-parameters for better performance in end-to-end training.

the number of feature pyramid levels. -> larger level means larger receptive field on time axis, so if actions are long, set this to larger value

the weight of regression loss. -> larger regression loss weight, means model will try to learn start/end time more accurate. but may influence the category loss?

the number of training epochs. -> more epoch will get better model, but may have overfit risk

the learning rate for the adapter. -> larger learning rate make model converge faster, but may fall local optimal.

sming256 commented 1 month ago

Perfect! Your understanding is completely correct. Since above are hyper-parameters, we need to search them to find the optimal setting given a new dataset.

tongda commented 1 month ago

Thanks for your patience.

I have tried input size as 224. Here is the last epoch log.

2024-07-17 15:13:51 Train INFO: [Train]: [059][00050/00126]  Loss=0.2956  cls_loss=0.1436  reg_loss=0.1519  lr_backbone=3.9e-05  lr_det=3.9e-05  mem=7642MB
2024-07-17 15:19:24 Train INFO: [Train]: [059][00100/00126]  Loss=0.2961  cls_loss=0.1430  reg_loss=0.1531  lr_backbone=3.8e-05  lr_det=3.8e-05  mem=7642MB
2024-07-17 15:22:14 Train INFO: [Train]: [059][00126/00126]  Loss=0.2924  cls_loss=0.1407  reg_loss=0.1518  lr_backbone=3.8e-05  lr_det=3.8e-05  mem=7642MB

And the evaluation result:

2024-07-17 15:31:35 Train INFO: Number of ground truth instances: 1855
2024-07-17 15:31:35 Train INFO: Number of predictions: 234000
2024-07-17 15:31:35 Train INFO: Fixed threshold for tiou score: [0.3, 0.4, 0.5, 0.6, 0.7]
2024-07-17 15:31:35 Train INFO: Average-mAP: 39.18 (%)
2024-07-17 15:31:35 Train INFO: mAP at tIoU 0.30 is 51.95%
2024-07-17 15:31:35 Train INFO: mAP at tIoU 0.40 is 46.34%
2024-07-17 15:31:35 Train INFO: mAP at tIoU 0.50 is 41.52%
2024-07-17 15:31:35 Train INFO: mAP at tIoU 0.60 is 33.16%
2024-07-17 15:31:35 Train INFO: mAP at tIoU 0.70 is 22.95%

There's still a lot of room for improvement. I will leave this issue open and keep update with different hyper-parameter experiments.

tongda commented 1 month ago

I made some change:

set epoch to 100;
add a center_crop transform after decord_init transform, because the video is 1920x1080 and most actions are occured at the center;
change resolution to 224x224;

the trainig curve looks great, converge fast at first and keep going down, :

however, the evaluation look not very good, the best avg-mAP (40.88) is 40-epoch which is the first evaluation. As lower the loss is, as worse the mAP is.

total log is here. log.txt

I wonder that, mAP may not reflect the actual performance. What do you think?

tongda commented 1 month ago

I visualized one of the test video with actions score > 0.3. Actions are added at the bottom, first line is GT and the others are predicted actions.

I feels the result makes sense. Some of the wrong actions are like "pick up back panel / pick up side panel". I think the result is better than last training result.

sming256 commented 1 month ago

Thank you for the update!

Best validation loss may not correspond to best mAP. Yes. Particularly in end-to-end trained ActionFormer, the best mAP happens in the middle epoch when training with longer epochs. This issue is related to the cosine scheduler setting for the optimizer.
About visualization. Your visualization result is pretty good. This visualization makes sense considering the ambiguity of the annotated action boundary and complicated action category.
To further improve the results. Since some actions may not be seen when the backbone is trained, and some actions are very similar, such as the pick-up back panel / pick-up side panel, you can also consider fine-tuning the backbone (such as VideoMAE) on the action recognition task using your own dataset. This approach is usually very effective on Ego4D/Epic-Kitchens datasets, since the pretrained videos and new videos are very different.

tongda commented 1 month ago

@sming256 I am confused about the fine-tuning.

..., you can also consider fine-tuning the backbone (such as VideoMAE) on the action recognition task using your own dataset.

What does the "action recognition task" means? Should I generate clips for each action for fine-tuning backbone? Or just unfreeze the back bone in the end-to-end training process?

Well, full fine-tuning will cost much more GPU mem than I can afford (I only have 4090).

From the paper, I found two point:

Using K400 for pretraining, we observe that end-to-end TAD training allows for +5.56 gain. Conversely, using a model already finetuned on EPICKitchens still yields a +2.32 improvement.

More importantly, when we apply the full finetuning on Ego4d-MQ, no performance gain was observed (27.01% mAP).

Maybe I should change the backbone to internvideo?

sming256 commented 1 month ago

Fine-tuning the backbone on the action recognition task. Just as is commonly done in EPIC-Kitchens, given a TAD dataset, you can trim only the foreground actions from the long video. Each action can then be considered a clip, resulting in an action recognition dataset. This approach is usually very helpful if the pretraining dataset has a large domain gap with the downstream detection dataset. Once you have this fine-tuned backbone, you can then use AdaTAD to further tune it on the detection task.
Changing the backbone to InternVideo1 is an option, but I guess the performance is similar to VideoMAE-L. InternVideo2 is too large to use end-to-end training so far.

sming256 / OpenTAD

AdaTAD works worse than expected on IKEA ASM dataset #27