Closed shivanimall closed 2 months ago
Hey, hope all is well!
The provided queries_xyt
is needed as input during evaluation, to prompt the model, so it knows which point trajectories to regress (and we can compare those exact point trajectories with GT trajectories). During generation of the TAPVid-3D evaluation dataset, query points were uniformly randomly sampled along the visible sections of the trajectory, one query point per trajectory, IIRC.
This is slightly different from the code you linked, which I believe is for the TAPVid (2D)-DAVIS strided evaluation, where the query points are chosen at regular intervals (the fixed query_stride
), so indeed, you can compute those just given the GT trajectories and visibility flags. This is a deterministic process for query point selection, so it didn't need to be provided.
There are pros and cons and reasons for each, but maybe that's for another post.
Hope that helps!
Closing this issue, but feel free to re-open if there's any other questions.
Hello,
Thanks a lot for your explanation.
I see that following is what the target_points
and query_points
look like. Pls note that I am only interested in 2D coords of these points.
In TapVid-2D code, the target_points
were resized or normalised to lie in the range of frames, is this also required for the target_points
here?
Plus, I am trying to interpret these x, y values below? If these are in metres, I assume we'd first have to convert it to lie in the video's grid coordinate system? Sorry, in case I missed it, is there any conversion code you've posted as a part of demo? Is this the viz function in block 12 in the code here
(Pdb) query_points.shape
(1, 256, 3)
(Pdb) query_points
array([[[9.77735881e+02, 8.49205850e+02, 1.55000000e+02],
[8.65673841e+02, 8.64507419e+02, 1.63000000e+02],
[9.34914097e+02, 6.26666269e+02, 5.00000000e+00],
[8.47458578e+02, 8.88853920e+02, 1.46000000e+02],
[9.30737061e+02, 6.54952400e+02, 1.30000000e+01],
[9.62066336e+02, 6.95222371e+02, 3.90000000e+01],
[9.72375402e+02, 8.59423373e+02, 1.21000000e+02],
[9.69111316e+02, 6.39282901e+02, 2.30000000e+01],
...
(Pdb) target_points.shape
(179, 256, 2)
(Pdb) target_points
array([[[935.05677151, 638.0139997 ],
[917.54683821, 643.21449828],
[936.69168266, 629.34635142],
...,
[921.22191931, 641.58799871],
[925.87548293, 630.90202337],
[928.2173429 , 647.11338822]],
[[935.15884225, 639.99935385],
[917.37813274, 645.26049502],
[936.8280104 , 631.20069439],
...,
@skoppula thanks a lot for your help and explanation! I am also curious to know more about the reasons / pros / cons, and will open a new post for it.
Hey again, hope all is well! Sorry for the late reply -- didn't get the notification for some reason (maybe because the issue was closed?).
I am trying to interpret these x, y values below
I suspect these values are from TAPVid-2D (not TAPVid-3D), and probably represent the pixel values (not meters) for the 2D tracking task. TAPVid-3D has "ground truth" saved as a tensor tracks_xyz
with shape (# of frames, # of tracks, 3)
corresponding to (x, y, z)
position of the point.
is this also required for the target_points here?
To read and visualize the raw dataset, you shouldn't need to normalize anything, you'll just need the intrinsics matrix to project from meters back into pixel space.
For computing the TAPVid-3D metrics, for evaluation, yes, you will need to run evalaution at 256x256 (used in computing the metric position error threshold). This is the convention we used for our numbers in the paper, like in TAPVid-2D, and this is set and indicated in the released metrics evaluation code: https://github.com/google-deepmind/tapnet/blob/main/tapnet/tapvid3d/evaluation/evaluate_model.py#L179, and also the updated arXiv paper copy.
If it helps, we've released a visualization Colab that loads and visualizes the samples in TAPVid-3D, and may be helpful to see to interpret each of the saved tensors: https://colab.research.google.com/drive/1Ro2sE0lAvq-h0lixrUBB0oTYXEwXNr66#scrollTo=VpZckeIS5t1k.
Hopefully this clears things up, feel free to reach out here or email if you have other questions!
Hello,
I saw that
queries_xyt
was supplied in Tapvid3D (Also, why was it given but not in TapVid). Can it also be directly computed fromtracks_xyz
/target_points
andvisibility
/occluded
as in the following code? (for eg: if I were to do this for 2D only)Thank you, and let me know if I am misunderstanding or missed an earlier github issue.