Open LHYUNMIN opened 1 month ago
A large part of HOI labels in the dataset are human-human interactions. Here, we have the statistics of all triplets.
For human-human interactions, the latter human is simply regarded as an object. So, there is no specific pre-processing for human-human interactions.
If you only need interactions between humans, the best case is that you set the object detector to detect only humans. Actually, we use another pretrained YOLOv5 weight to detect human faces. That weight can detect humans and faces. You can refer to this class, set include_person
to True
in track_one
method and directly use this as the object_tracking_module here.
This framework is not optimized for real-time video inference. As shown in our paper, the inference time of the Transformer with window size=6 alone is about 10 FPS. The object tracking and gaze following also take a bit time. In my inference script, I set HOI prediction to once per second.
Thank you for the great response.
When identifying relationships between people, a single person is recognized as an object. In this case, what determines a person as an "object"?
For example, when there are two people, how is it decided which one becomes the object?
I can't remember the full details, but I think it works as: you detect two people A and B. The model will predict both <A-action-B>
and <B-action-A>
.
I am currently researching the correlation between people, and I found the paper very interesting.
Excuse me, did you reproduce the author's SOTA metrics on the VIdHOI dataset?My reproduce results differ greatly from the author's.
mAP | Full | Non-rare | Rare | Detection: reproduce results | 10.25 | 16.423 | 5.8 | Authors results | 10.4 | 16.83 | 5.46 | Oracle: reproduce results | 35.64 | 49.31 | 25.14 | Authors results | 38.61 | 52.44 | 27.99|
How is the learning process conducted in the context of human-human correlations during training?
Was human-human interaction data added to the existing dataset?
Is it possible to detect interactions between humans alone, not just human-object interactions?
If the input is a video, is real-time inference possible with video input?