Closed lzl2040 closed 1 year ago
For each video, they optimize the model on all frames that can then be queried to track pixels through the video. Thus it is a "test-time" optimization method since there is no set of weights or something that works on all videos, you instead build the representation of the whole video and then you can do tracking. It's like a NeRF that requires a bunch of frames to make the implicit representation.
@rlee3359 's explanation is correct. Thank you! Let me know if there is any other question.
Why is it a test-time optimization method? I do not find something is adopted in the test stage.