Evaluation code? - Githubissues

Wuziyi616 commented 1 year ago

Hi, thanks for releasing this great work! I wonder if you also plan to release the evaluation code that computes the FG-ARI of predicted segmentation masks and the pre-trained weight on each dataset. I have one question regarding this metric:

According to here, MOVi-D might have a maximum of 20+3 = 23 objects in the videos
However, you use 15 slots for STEVE, which is less than this number
So I wonder how you handle this case, when the number of GT objects is larger? I haven't tried MOVi-D before so maybe there won't be such a case (i.e. video with >15 objects) in the dataset. Or if you use Hungarian matching to match the objects with max overlaps? Or other advanced techniques?
Okay, I just checked FG-ARI and it seems that it doesn't matter if the number is different. So now my question is just regarding the eval code + weight

singhgautam commented 1 year ago

Hi Ziyi,

Thank you for the encouraging comment!

We have now added the code to compute FG-ARI in a branch called evaluate.
We used 15 slots in MOVi-D/E because this is approximately equal to the mean number of objects in any given video. If we use 23 slots but a given video has much fewer objects, we would observe some splitting of the larger objects into parts. However, in our opinion, such splitting is not a special limitation of STEVE (or the proposed transformer-based decoding) and it likely affects any model that uses a slot-attention encoder that lacks the incentive to have unused slots. Also, splitting is not necessarily a bad thing depending on the downstream task, as long as the splits are semantically meaningful. Also, note that the choice of 15 slots was used consistently for all the compared models i.e. STEVE and the baselines.
We are currently discussing about the release of model weights. If we do release them, we plan to do it around the same time that we release the datasets.

Best, Gautam.

Wuziyi616 commented 1 year ago

Thank you for your prompt reply. Indeed, I also observe object splitting in SAVi (and, the model I'm currently working on :-)) when there are many unused slots. I'm looking forward to the dataset release as well.

Wuziyi616 commented 1 year ago

I apologize if I misunderstand the code, but according to these lines, it seems that you are using your own split of train-val-test, instead of the splits provided by MOVi authors?

Also, you use phase='full' to load the testing dataset. So it seems that you are reporting FG-ARI on all the videos in both train/val/test sets? Though I agree this may not be a big issue, since STEVE is an unsupervised segmentor, and it doesn't see the GT masks for the training set either.

singhgautam commented 1 year ago

All our metrics are computed on the official held-out sets. The 'full' applies to the data directory containing only the held-out videos.

singhgautam commented 1 year ago

I have now removed those lines to prevent confusion.

Wuziyi616 commented 1 year ago

I see, thanks, that's very clear now!

singhgautam / steve

Evaluation code? #2