mrwu-mac / EoID

Repo for the paper ''End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation (AAAI2023)''.
Apache License 2.0
22 stars 3 forks source link

Training when model='cdn'? #6

Closed PradKalkar closed 1 year ago

PradKalkar commented 1 year ago

Hi. Thanks for your work!

I have a small doubt. In the train script, I just changed the model to 'cdn'. But, still it gives good mAP for unseen within first few epochs reaching around 8.3%mAP. I am not really sure how is it giving a high mAP for unseen action setting. The CDN model hadn't incorporated zero-shot capabilities in their model, so thats why I am confused. Would you please clarify this doubt?

mrwu-mac commented 1 year ago

Hi. Thank you for your interest in our work.

You might try to visualize the results with test_on_image.py to analyze whether the score on unseen action is reasonable. In addition, the transformer-based model has less inductive biases on seen HOI, which shows a excellent unseen HOI detection potential, thus contributing to the success of our method.

PradKalkar commented 1 year ago

Thanks @mrwu-mac for the quick reply. I think I understood the reason why even the base CDN model is able to perform zero-shot HOI Detection. The 2 stage bipartite matching algorithm which is novel in your work was not present in original CDN, and since it is being used in simple CDN as well, it is able to detect unseen HOIs. Would you kindly confirm if the reason is correct?

mrwu-mac commented 1 year ago

The Two-stage Bipartite Matching is only used to discover potential interactive pairs, which does not have the ability to classify unseen actions. However, it is worth noting that there are a large number of multi-label human-object pairs in the HICO-Det dataset, which may contain both seen and unseen actions. So even the original CDN has the ability to assign a very low score for these unseen actions, which depending on whether gradients of these unseen actions are backpack to the CDN.

PradKalkar commented 1 year ago

But when we train the "CDN model" only with Single-stage Bipartite Matching (I found that it is being used in hoi.py) in Zero-Shot UA setting, we are not showing the model those HOIs which contain unseen actions during training, right? So, the model is learning just from the HOIs involving seen actions. So, how does it still recognize HOIs belonging to UA without adding any intelligent module such as CLIP (which is used in EoID for zero-shot capabities)?

mrwu-mac commented 1 year ago

The CDN trained on the seen pairs has the ability to detect some pairs that contain unseen actions (See our paper Table 1). But it assigns a small score for these unseen actions, because it sets label 0 for these unseen actions when trained on other seen actions. It is unreasonable to use original CDN for zero-shot setting experiments, this is why we use CDN+ConsNet as our baseline.

PradKalkar commented 1 year ago

Ohh got it. Thanks @mrwu-mac for your quick and detailed responses.

PradKalkar commented 1 year ago

Hi. Just had a small doubt. In the UC setting, since the seen HOIs cover all the objects and actions during training as per the setting definition, would there be any need for knowledge distillation in this setting? Wouldn't the CDN model itself (without any zero-shot capabilities) give a very good score for the UC setting in the unseen case since the interaction decoder is ultimately classifying actions (all of which are seen) and not the interactions (120 of which are unseen in UC setting)?

mrwu-mac commented 1 year ago

Knowledge distillation improves the performance on UC setting, as shown in our paper.

PradKalkar commented 1 year ago

Yeah, but I couldn't logically understand the necessity of knowledge distillation in UC setting. As I already said, the objects and the actions are already seen during training in UC. But for UA, I understand that all actions aren't seen so, external knowledge is required to reason on unseen actions. I am asking this for improving my conceptual understanding in HOI Detection. Would you please help?

mrwu-mac commented 1 year ago

The UC setting is to recognize the unseen verb-object combinations, it is possible for the CDN to learn from 'ride horse' to recognize 'ride bicycle', but it couldn't from 'ride horse' to 'ride car'. Combining the external knowledge is used for unseen action, but it is also can be seen as recognize the action of the unseen verb-object combinations. Hope this helps you.

PradKalkar commented 1 year ago

Thanks for the reply. So, basically you are saying that CDN won't be able to account for the polysemy of action issue, right? Since the action ride is almost similar for horse and bicycle but it has a different interpretation for car. So, basically I am concluding that incorporating external knowledge from CLIP helps the model to handle the polysemy. Have I understood it correctly?

mrwu-mac commented 1 year ago

Right.

PradKalkar commented 1 year ago

Thanks a lot for your help!