google-deepmind / tapnet

Tracking Any Point (TAP)
https://deepmind-tapir.github.io/blogpost.html
Apache License 2.0
1.31k stars 124 forks source link

Question on Queries_XYT #116

Closed shivanimall closed 2 months ago

shivanimall commented 2 months ago

Hello,

I saw that queries_xyt was supplied in Tapvid3D (Also, why was it given but not in TapVid). Can it also be directly computed from tracks_xyz / target_points and visibility / occluded as in the following code? (for eg: if I were to do this for 2D only)

Thank you, and let me know if I am misunderstanding or missed an earlier github issue.

skoppula commented 2 months ago

Hey, hope all is well!

The provided queries_xyt is needed as input during evaluation, to prompt the model, so it knows which point trajectories to regress (and we can compare those exact point trajectories with GT trajectories). During generation of the TAPVid-3D evaluation dataset, query points were uniformly randomly sampled along the visible sections of the trajectory, one query point per trajectory, IIRC.

This is slightly different from the code you linked, which I believe is for the TAPVid (2D)-DAVIS strided evaluation, where the query points are chosen at regular intervals (the fixed query_stride), so indeed, you can compute those just given the GT trajectories and visibility flags. This is a deterministic process for query point selection, so it didn't need to be provided.

There are pros and cons and reasons for each, but maybe that's for another post.

Hope that helps!

skoppula commented 2 months ago

Closing this issue, but feel free to re-open if there's any other questions.

shivanimall commented 2 months ago

Hello,

Thanks a lot for your explanation.

I see that following is what the target_points and query_points look like. Pls note that I am only interested in 2D coords of these points.

In TapVid-2D code, the target_points were resized or normalised to lie in the range of frames, is this also required for the target_points here?

Plus, I am trying to interpret these x, y values below? If these are in metres, I assume we'd first have to convert it to lie in the video's grid coordinate system? Sorry, in case I missed it, is there any conversion code you've posted as a part of demo? Is this the viz function in block 12 in the code here


(Pdb) query_points.shape
(1, 256, 3)
(Pdb) query_points
array([[[9.77735881e+02, 8.49205850e+02, 1.55000000e+02],
        [8.65673841e+02, 8.64507419e+02, 1.63000000e+02],
        [9.34914097e+02, 6.26666269e+02, 5.00000000e+00],
        [8.47458578e+02, 8.88853920e+02, 1.46000000e+02],
        [9.30737061e+02, 6.54952400e+02, 1.30000000e+01],
        [9.62066336e+02, 6.95222371e+02, 3.90000000e+01],
        [9.72375402e+02, 8.59423373e+02, 1.21000000e+02],
        [9.69111316e+02, 6.39282901e+02, 2.30000000e+01],
...
(Pdb) target_points.shape
(179, 256, 2)
(Pdb) target_points
array([[[935.05677151, 638.0139997 ],
        [917.54683821, 643.21449828],
        [936.69168266, 629.34635142],
        ...,
        [921.22191931, 641.58799871],
        [925.87548293, 630.90202337],
        [928.2173429 , 647.11338822]],

       [[935.15884225, 639.99935385],
        [917.37813274, 645.26049502],
        [936.8280104 , 631.20069439],
        ...,
shivanimall commented 2 months ago

@skoppula thanks a lot for your help and explanation! I am also curious to know more about the reasons / pros / cons, and will open a new post for it.

skoppula commented 1 month ago

Hey again, hope all is well! Sorry for the late reply -- didn't get the notification for some reason (maybe because the issue was closed?).

I am trying to interpret these x, y values below

I suspect these values are from TAPVid-2D (not TAPVid-3D), and probably represent the pixel values (not meters) for the 2D tracking task. TAPVid-3D has "ground truth" saved as a tensor tracks_xyz with shape (# of frames, # of tracks, 3) corresponding to (x, y, z) position of the point.

is this also required for the target_points here?

To read and visualize the raw dataset, you shouldn't need to normalize anything, you'll just need the intrinsics matrix to project from meters back into pixel space.

For computing the TAPVid-3D metrics, for evaluation, yes, you will need to run evalaution at 256x256 (used in computing the metric position error threshold). This is the convention we used for our numbers in the paper, like in TAPVid-2D, and this is set and indicated in the released metrics evaluation code: https://github.com/google-deepmind/tapnet/blob/main/tapnet/tapvid3d/evaluation/evaluate_model.py#L179, and also the updated arXiv paper copy.

If it helps, we've released a visualization Colab that loads and visualizes the samples in TAPVid-3D, and may be helpful to see to interpret each of the saved tensors: https://colab.research.google.com/drive/1Ro2sE0lAvq-h0lixrUBB0oTYXEwXNr66#scrollTo=VpZckeIS5t1k.

Hopefully this clears things up, feel free to reach out here or email if you have other questions!