seoungwugoh / STM

Video Object Segmentation using Space-Time Memory Networks
405 stars 81 forks source link

Weight for the interactive track of DAVIS 19 challenge #15

Open zyy-cn opened 4 years ago

zyy-cn commented 4 years ago

Hi: Thanks for sharing the code. I notice that the current released weight is for the semi-supervised track and different from the weights you used in the interactive track of the DAVIS 19 challenge. I test this weight under the Davis-interactive framework follow the official challenge setting and only achieve AUC 67.74 on the DAVIS 17 validation set. I wonder if you have any plan to release the weights which trained for the interactive track of the DAVIS 19 challenge?

seoungwugoh commented 4 years ago

We are now under review for the interactive version of STM. We plan to upload the code for the interactive VOS after the review process is finished.

zyy-cn commented 4 years ago

Thanks for your replay!

I have two more questions:

  1. I'm trying to train the model for the interactive VOS task with the following process: a). prepare [A_image, Amask, B image, C_image] for input, [B_mask, C_mask] for GT. b). memorize [A_image, scribble(A_mask)], scribble(*) indicates drawing the scribble onto the mask according to the area of FP and FN. c). segment [B_mask] with the memory of A. d). memorize [B_image, scribble(B_mask)] e). segment [C_mask] with the memory of A, B. and loss is computed with B_mask, C_mask. Is this above process correct?

  2. What is the result (AUC for J&F) you achieved on the Davis 17 validation set on the interactive VOS task with STM? And what is the accordingly GPU for the inference?

seoungwugoh commented 4 years ago

Hi, the training protocol of the interactive model is somewhat different from semi-supervised model (described in the paper under review). In the DAVIS interactive scenario, It does not need to process a video in a sequential order. And there is multiple rounds. To briefly explain:

a) memorize [A_image, scribble(A_mask_r0, A_mask_GT)] where r0 mask is all zeros. b) segment [A_image, B_image, C_image] -> [A_mask_r1, B_mask_r1, C_mask_r1] c) memorize [C_image, scribble(C_mask_r1, C_mask_GT)] d) segment [A_image, B_image, C_image] using two memories. losses are computed for all predictions.

STM is properly modified to be applicable to interactive mode. We used a 2080 Ti GPU. We will make the paper for interactive STM after the review process.