Open yhyu13 opened 1 year ago
Thank you for your question! TAPIR is undoubtfully an amazing work. Our method and TAPIR are fundamentally different in the way they work and I believe they are complementary.
TAPIR and most tracking methods are feed-forward methods, but our method is a test time optimization-based method. TAPIR are trained on large amounts of video data, and when given a new video sequence at test time, it can be used to directly compute the raw tracking results for this video. Our method, on the other hand, is a test-time optimization method, which means our method needs to be optimized on each video separately (substantially slower!). To perform the optimization, our method takes the raw tracking results from existing methods as the noisy supervising signal. So methods like TAPIR provide input motion to our system, and our method can reconcile and complete the possibly noisy and inconsistent motion to get a global motion representation for the video. With better input motion, the results of our method will also likely get better. And as mentioned on TAPIR's webpage, our method "could potentially be used on top of TAPIR tracks to further improve performance." Note that TAPIR achieves much better tracking accuracy on the TAP-Vid benchmark than OmniMotion optimized with input motion data from RAFT.
Some other differences include: our method produces a compact representation of the motion of the entire video; our optimized motion tends to be more temporally coherent; our method can provide plausible locations for points when they are occluded; our method provides pseudo-3D reconstructions.
Lastly, in my opinion, we need both generalizable methods like TAPIR which learns very useful priors from data, and test-time optimization methods like ours that can take the noisy motion data and refine them for a particular video sequence for better quality and coherence.
- To perform the optimization, our method takes the raw tracking results from existing methods as the noisy supervising signal.
For this step : "To perform the optimization, our method takes the raw tracking results from existing methods as the noisy supervising signal." Do your method need trajectories across all video frames or just frames before the current time t?
It takes the trajectories across all video frames.
https://github.com/deepmind/tapnet#tapir-demos
Deepmind had a similar work using the same testing images in your work? How is your work different from deepmind fundamentally?
I am a hobbyist, so it's gratfull for you to spend time explaining briefly the differences, purpose, approach, and result wise.
Thanks!