Describe the feature and the current behaviour/state
Currently, when feeding a video to the Hand Landmark detector, all frames are analyzed independently to my knowledge. This causes that for example if during a video there is a certain Hand A and the list of landmarks results has length 1, when a second Hand B comes into frame the results will have length 2, but no guarantee that the first element of the list will be Hand A or B. This obviously happens repeatedly when multiple hands are present in the video. In a certain frame the list might be [A, B, C], the next frame might be [B, A, C] thus making it very hard to perform some kinds of operations.
Will this change the current API? How?
No response
Who will benefit with this feature?
No response
Please specify the use cases for this feature
I'm implementing a gesture classification, where the static image of the hand is not enough to recognize the gesture. I have a dataset of videos, each with a single hand, so from that I can extract the landmarks and develop a model. However, at real-time inference, if multiple hands appear on the video I have no way to apply my model to each hand in the frame (assuming I've stored the information of the previous frames). This is because the temporal sequence of each element of the results list is broken across the different hands.
Thank you for your detailed explanation about the feature request, We are forwarding this issue internally based on the discussion team will prioritise the request.
MediaPipe Solution (you are using)
Hand Landmark Detection
Programming language
Python
Are you willing to contribute it
No
Describe the feature and the current behaviour/state
Currently, when feeding a video to the Hand Landmark detector, all frames are analyzed independently to my knowledge. This causes that for example if during a video there is a certain Hand A and the list of landmarks results has length 1, when a second Hand B comes into frame the results will have length 2, but no guarantee that the first element of the list will be Hand A or B. This obviously happens repeatedly when multiple hands are present in the video. In a certain frame the list might be
[A, B, C]
, the next frame might be[B, A, C]
thus making it very hard to perform some kinds of operations.Will this change the current API? How?
No response
Who will benefit with this feature?
No response
Please specify the use cases for this feature
I'm implementing a gesture classification, where the static image of the hand is not enough to recognize the gesture. I have a dataset of videos, each with a single hand, so from that I can extract the landmarks and develop a model. However, at real-time inference, if multiple hands appear on the video I have no way to apply my model to each hand in the frame (assuming I've stored the information of the previous frames). This is because the temporal sequence of each element of the results list is broken across the different hands.
Any Other info
No response