google-deepmind / tapnet

Tracking Any Point (TAP)
https://deepmind-tapir.github.io/blogpost.html
Apache License 2.0
1.23k stars 115 forks source link

Bad Performance on Visual Odometry Image Sequences? #72

Open C-H-Chien opened 8 months ago

C-H-Chien commented 8 months ago

Hi,

I am interested in generating a bunch of feature tracks across a number of frames from a visual odometry sequence, e.g., KITTI, EuRoC datasets. However, when I try it with the demo, the number of features are low and the length of the feature tracks are pretty small, e.g., 5-7.

In the paper, it says "Models tend to fail on real-world videos with panning". I am not fully understand what it means. Is that the reason why this method does not perform well on visual odometry sequences?

Thank you!

cdoersch commented 8 months ago

I haven't tried working with visual odometry sequences, but there's a couple of issues. First, TAPIR tends to fail when there's large changes in scale, which are common in odometry sequences. Second, there's some challenging surfaces in odometry sequences: roads have very little texture (and repretitive textures like road markints), and trees are porous.

Because of the 'uncertainty estimate', TAPIR tends to mark tracks as occluded if it's not certain of the location; you can use the occlusion estimate directly, but the tracks are likely to be wrong.

Another option is to try to increase the resolution; remember that TAPIR does its intialization at 256x256. In the case of odometry sequences, points will typically remain in the same image quadrant where they started, so you might be able to improve performance by running TAPIR in a tiled fashion.

Otherwise, I think you'll just have to wait for fundamental TAP research to progress.

C-H-Chien commented 8 months ago

Thanks for the comments! I have a couple of questions:

  1. In what extent do you think resizing images is helpful in terms of estimating good feature tracks and efficiency (right now my running was quite slow, e.g. 20-30 mins for 50 frames with 512x512 resolution, even with GPU)
  2. Do you think training the TAPIR on odometry sequences from scratch would resolve the issue? Thanks again!
cdoersch commented 8 months ago

How many points are you tracking? If things are set up properly, then a few dozen points should take seconds, even at 512x512 resolution. I suspect you're mostly seeing JAX compilation time.

Training TAPIR on odometry sequences would certainly help. Probably fine-tuning would be more efficient than training from scratch (and probably more effective if you don't have a lot of data), but either should help. However, we haven't tried this. Out of curiosity, which data do you plan to use? We aren't aware of much odometry data with long-term tracks; what's available tends to rely on structure-from-motion, and it doesn't come with reliable occlusion estimates.

bhack commented 1 month ago

I have also found many of the described problems and false positives matching in these type of sequences.

I suppose that for an odometry like camera movement it would also require a strategy to auto allocate query points like: https://chiaki530.github.io/projects/leapvo/