hkchengrex / XMem

[ECCV 2022] XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model
https://hkchengrex.com/XMem/
MIT License
1.7k stars 184 forks source link

Running the model with frozen memory: how to? #52

Closed max810 closed 1 year ago

max810 commented 1 year ago

Hi!

I'm been doing some experiments and have the following use-case:

  1. We run the model in the regular way on a full video
  2. The user manipulates the memory in some way (doesn't matter how exacty)
  3. We want to re-run the predictions on all the frames (excluding those with annotated masks provided, obviously), but with the memory already full from the previous run and frozen - i.e. no adding/removing anything from it.

I have 2 questions:

  1. What changes should I do in the code to run the model with "frozen" memory? I went to inference/inference_core.py#step, and wanted to modify it, but there is a lot going on.
  2. For me personally it makes sense to keep working and long-term memory frozen (because they are essentially just a reference for new predictions), but what about sensory memory? It should still be updated every frame, right?

I would appreciate your advice here.

Thanks in advance!

hkchengrex commented 1 year ago

Hi,

You can remove add_memory: https://github.com/hkchengrex/XMem/blob/139a7458b84a4829cb3bc7434cea1e7383eca5f8/inference/inference_core.py#L99-L100

This would disable updates to the working/long-term memory while keeping the sensory memory updated.

You can also remove this entire block: https://github.com/hkchengrex/XMem/blob/139a7458b84a4829cb3bc7434cea1e7383eca5f8/inference/inference_core.py#L96-L105

This would additionally disable deep updates to the sensory (and it will run faster). Regular update to the sensory memory is still performed. Feel free to let me know if doing these give you errors.

(2) - yeah makes sense to me.

Note that the working memory will then be frozen at the end of the video. When the video is re-run from the beginning, it is no longer a "high-resolution buffer from the recent past".

max810 commented 1 year ago

Thanks, I'll try that and let you know if it works.

And you made a good point about the working memory. Regarding that, I have a question:

In the paper you state that, quote, "The first frame (with user-provided ground-truth) and the most recent $T{min} − 1$ memory frames will be kept in the working memory as a high-resolution buffer while the remainder ( $T{max} − T_{min}$ frames) are candidates for being converted into long-term memory representations" (part 3.3, second paragraph on arxiv)

If I provide multiple annotations for the video (e.g. frame 0, frame 10 and frame 30), I would want to keep all of them in the working memory forever, right? Because these are 100% correct masks and contain valuable info for the decoder.

Does the code automatically allow for that (keeping feature maps with provided annotations in the working memory forever besides frame 0)? Or I should implement it myself?

I assume the part of the code responsible for selecting candidates is inference/memory_manager.py#L234, and by specifiying HW as a start value, we skip the first feature map stored in the working memory? So, if I want to store multiple feature maps there, I would slice the working memory multiple times using the same KeyValueMemoryStore.get_all_sliced method?

max810 commented 1 year ago

Also, do I understand correctly that the order of feature maps in both working and long-term memory does not matter? E.g. if we shuffle them, the predictions would not change at all?

hkchengrex commented 1 year ago
  1. Currently, we hardcoded the HW term to always keep the first frame in memory. There should be multiple ways in which you can keep more than one frame. The one that you mentioned sounds reasonable.
  2. Yes. When you shuffle, you need to shuffle the key/value/shrinkage/... together.
jelenaam commented 1 year ago

Hi @max810, I'm having similar issue with providing and keeping multiple annotations in working memory. Did you manage to solve it?

max810 commented 1 year ago

Hi @jelenaam,

No, sorry, we're still working on it, but so far it requires quite a lot of code modifications, so it will definitely takes some time to finish.

max810 commented 1 year ago

Hi @jelenaam,

So yeah, we did solve this problem now, and made a paper about it: https://github.com/max810/XMem2

XMem++ is a wrapper around XMem (so using the same model), but it supports using multiple ground truth annotations (and keeping them in the memory permanently) out of the box now.

If you also want to disable updates to temporary memory at all (i.e. just not use it), then please look at inference_core.py#L135, just pass False for all frames.

Big thanks to @hkchengrex for his explanations!

jelenaam commented 11 months ago

Hi @max810, Congrats on your paper! Thank you for letting me know about it.