sauradip / STALE

[ECCV 2022] Official Pytorch Implementation of the paper : " Zero-Shot Temporal Action Detection via Vision-Language Prompting "
https://sauradip.github.io/project_pages/STALE/
98 stars 10 forks source link

stale_best_score #12

Closed andypinxinliu closed 1 year ago

andypinxinliu commented 1 year ago

In the inference, if the score is lower than stale_best_score, then the label be replaced with the label from this JSON file. How is this JSON file obtained? I also found out that without using this JSON file, then the mAP will drop to 9 instead of 24.9. Why is that?

yunhanwang1105 commented 1 year ago

@sauradip Could you answer this?

sauradip commented 1 year ago

Hi,

This is a well known problem for ActivityNet , due to short length videos and high intra-class variance , there are not much video frames specific to certain actions. So for ActivityNet it is necessary to do this postprocessing with external trained classifier. You can have a look at this : https://github.com/TencentYoutuResearch/ActionDetection-AFSD/issues/4

rjbruin commented 1 year ago

Do I understand correctly that that means that if your model doesn't train well on ActivityNet, you instead use the predictions of another model?

I don't see anything about this in the paper. Is that correct?

sauradip commented 1 year ago

We don't use that completely. You can check our code , we use it only when the performance of our classifier is worse than pretrained ones. These post-processings are often used in TAD literature as shown above. This is nothing new introduced. AFSD (the paper mentioned above) also uses UNet postprocessing. But still it does not mention that on paper.

rjbruin commented 1 year ago

Could you clarify what you mean by "we don't use [it] completely"? Do you mean "we never use it" or "we don't always use it"?

I understand that you use these pretrained scores, but they are from a different model than yours, I believe? From UntrimmedNet, according to https://github.com/TencentYoutuResearch/ActionDetection-AFSD/issues/4? Or is that different in this case?

If the outputs are sometimes replaced with those of another model, wouldn't it be hard to know if the performance is from the proposed model or the other model? If this is not reported in the paper, it would be hard for a reader to be aware of this fact.

sauradip commented 1 year ago

We don't always use it. Existing literature like AFSD never uses their own model prediction from action classifier. We argue that saying our predictions are also sometimes better than the used UNet model. Hence for few videos in testing set we don't use the predicted classes from Unet l. We do this by checking the confidence of our classifier vs UNet.

Yes we used same UNet model as post-processing.

rjbruin commented 1 year ago

Ah, okay. So let me see if I understand: your premise is that if the model is not confident, then you can replace it with the predictions of another model, by using that model's predictions. But that will only work if those predictions are available, right?

So this will work well on a standardized benchmark live ActivityNet, but it will not work for a company's in-house dataset, or in an applied setting with unseen videos? Or am I misunderstanding?

sauradip commented 1 year ago

Yes , UNEt scores are available for standardized benchmarks like THUMOS and ActivityNet.

However, you can also use for in-house datasets. I will recommend , use any 'weakly supervised temporal action detection' method ( weakly Supervision: video label ) , so the classifier of weakly supervised actually returns the classes. You can train such networks on your customized data and get class scores for your own in-house data

rjbruin commented 1 year ago

Thanks for the suggestion! I see this would indeed probably work.

However, this extra step is not discussed in the paper, as far as I can see. Is that correct? Would that make it hard to reproduce the published results on ActivityNet, if one were to reproduce the method as described in the paper?

yunhanwang1105 commented 1 year ago

Thanks for the reply. After disabling score enhancement, I also got an average-mAP of 8.5. I think this can somehow represent that the prediction result of STALE is largely bounded by the result of UntrimmedNets.

However, in the paper you mentioned an important property of STALE is to prevent localization error propagation. If the classification is largely classification from UNets, then how can we show that the one-stage design can truly prevent localization error propagation?

Moreover, the whole point of CLIP is to do classification via vision-language modeling. The research question in the paper is: Whether the impressive ability of CLIP can be transferred to more complex vision tasks like dense prediction? If the STALE classification is largely classification from UNets, then STALE seems more like a network that does only localization. In this case, how can we answer this research question?

Lastly, STALE concentrates on zero-shot, while UNets is weakly supervised. If we mostly use the predictions from UNets, then how can we say STALE's performance is zero-shot?

I will appreciate your answers.

Kind regards.

sauradip commented 1 year ago

FIrst, We are not using UNET entirely for all the classes. https://github.com/sauradip/STALE/blob/a574ca630a5d2658eeb47798aa94a53d3188bf07/utils/postprocess_utils.py#L138 Here we only use the classes from UNET where STALE classifier performs worse. So it is a form of refinement and not replacing our classifier like https://github.com/TencentYoutuResearch/ActionDetection-AFSD/issues/4

Second, Its true CLIP is used for classification, but a thing to note why we need refinement of UNET stems from the fact that CLIP was never designed for videos. This was a first attempt to show how CLIP can be adapted. For strong generalization it is recommneded to unfreeze the backbone and use ActionCLIP pretraining instead of vanilla CLIP pretraining. We did not have the resources hence we stuck to pre-trained CLIP.

Third, This problem is very specific to ActivityNet ( check AFSD, BMN) where only 3 min videos per class does not have enough frames for classification. This refinement has been followed by TAD community for years. https://github.com/wangxiang1230/SSTAP/issues/13 this is a semi-supervised technique where partial labels are available but still the authors follow UNET , hence this is a problem specific to ActivityNet dataset. Such problems does not arise in THUMOS dataset.

HYUNJS commented 1 year ago

https://github.com/sauradip/STALE/issues/23#issuecomment-1704056365

As far as I know, heavy dependence on UNet predictions is mostly observed in the models of class-agnostic action proposals - the referenced SSTAP model, BMN, and GTAD are not proposed for the classification of actions. Also, in the THUMOS14 dataset, such refinement technique is not adopted in full TAL models (e.g., ActionFormer). Here, you mentioned Such problems does not arise in THUMOS dataset, but I can only achieve 2 mAP in the closed setting with the implementation details available in the main paper.

On the other hand, STALE is aimed at predicting action instances (start; end; class; confidence) in a zero-shot manner. How can we say model using such refinement as zero-shot TAL model - isn't that an unfair comparison?

sauradip commented 1 year ago

SSTAP is a semi-supervised setting, meaning labels for majority of videos are not available. However, authors used UNET which was trained with all the videos with labels using weak supervision. This is a unfair way to evaluate the semi-supervised setting in first place where you are using the video information for classifier which should not be used. TAL models like ActionFormer has similar design as us (MS-Backbone, Conv Decoder parallel ) , AFSD as well. AFSD claimed to have a detection pipeline ( i.e classifier should give action class prediction), but in actual implementation they use UNet refinement.

Yes, THUMOS has some specific trick i will opensource it once i am free, i will go through it and let you know. The main key is to delay the overfitting. The more you delay ( which you say loss not reducing etc) the better chance of high performance. Learning rate, optimizer step and some hyperaprams are crucial in such cases. When we first started the project we did not know that a low learning rate ( than CLIP's original implementation) is suitable for detection problem. So such scenarios are to be considered before saying it is not working.

as i said before see GTAD implementation for ActivityNet vs GTAD for THUMOS , you will find some difference in inference approach , both are not identical. The intrinsics of both ActivityNet and THUMOS are not same, hence very few TAD approach can equally give strong performance by not using any special refinement/trick/postprocessing.

This is the reason i would recommend you to start with AFSD and just put in the blocks similar to us and check if that is giving a good performance. It is also 1-stage approach like us. My approach for THUMOS in STALE is based upon AFSD.

HYUNJS commented 1 year ago

Thank you for your reply! I see that using UNet prediction for refinement has been used for previous works and has been a convention in a TAL community.

My main issue was could not reproduce the results (including the closed-vocab setting which is not related to overfitting to seen class as the zero-shot setting) w/o using any refinement or postprocessing. Now, I understand that it also requires them as conventional TAL/TAD methods.