seoungwugoh / STM

Video Object Segmentation using Space-Time Memory Networks
405 stars 81 forks source link

Results on Youtobe #3

Open noUmbrella opened 4 years ago

noUmbrella commented 4 years ago

Hi, I test your released code and model on Youtobe, but I can get the accuracy reported in the paper. Did you test this code on Youtobe?

seoungwugoh commented 4 years ago

The checkpoint in this repo is different from one for Youtube-VOS evaluation. For Youtube-VOS evaluation, we did not use DAVIS videos for training. (This gives us a minor improvement)

However, the provided checkpoint should also give similar numbers with one reported in our paper with minor degradation (about 1-2 lower Overall score). How was your results?

For Youtube-VOS, there are some differences compared to DAVIS: 1) Some objects start to appear in the middle of video. In that case, we overwrite current mask with the new objects. 2) While evaluation server takes results computed every 5 frames, we use all the frames for estimation. We first estimate masks for all the frames, then sample frames to submit from there.

noUmbrella commented 4 years ago

Great!It surprised me that using DAVIS videos for training will degrade the performance on Youtube-VOS. Thank you for sharing. I will retest it with your mentioned 1 and 2. Thanks.

sourabhswain commented 4 years ago

@seoungwugoh Can you please tell how to test the pretrained model on YouTube VOS ? I tried to use the YoutubeVOS dataset instead of DAVIS17, however, I seem to get empty masks as output.

seoungwugoh commented 4 years ago

Getting an empty mask seems to be due to bugs in the code.

siyueyu commented 4 years ago

@seoungwugoh In the case that Some objects start to appear in the middle of video then overwriting current mask with the new objects, will the overwritten mask include the old objects?

seoungwugoh commented 4 years ago

@siyueyu Yes, we overwrite the pixels belongs to the new object. Other pixels remain the same.

sourabhswain commented 4 years ago

@seoungwugoh I could get the correct masks now as predictions, however, I keep getting out of memory error when I test it on YouTubeVOS. I am using all the validation frames instead of every 5 frames. The GPU I am using is GTX 1080. Did you recommend using any particular configurations for YouTubeVOS ? I even played with the mem_every parameter, but still getting out of memory issues.

seoungwugoh commented 4 years ago

For YoutubeVOS, some videos are quite long (> 150 frames), it often cause OOM. GPU memories are mostly consumed by a large matrix inner-product when memory reading. We used V100 GPU which has 16GB memory and setting a larger mem_every parameter for some videos works well. To drastically reduce memory consumption. you can consider to use no intermediate memory frames (infinite mem_every). Another extreme solution will be convert that inner-product part to CPU if you afford additional computation time.

sourabhswain commented 4 years ago

@seoungwugoh Thanks for the suggestion. I ran it without any intermediate frame and could obtain the results. However, I see that it doesn't consider the masks of objects which start to appear after the first frame. I get no predictions for those objects. Looking at your suggestion above in this thread, you mention that "Some objects start to appear in the middle of video. In that case, we overwrite current mask with the new objects." I already modified the dataset.py. Is it already implemented in the uploaded code ? If not, can you point out where do we need to incorporate those changes ? Thank you.

sourabhswain commented 4 years ago

@seoungwugoh Also, to add to what I mentioned above, I get a score of 69.4 (compared to 78.4 in the paper) on YouTube validation set using the pre-trained model. Since, I used no intermediate memory frames, I guess by default it takes only the first and the previous frame.

npmhung commented 4 years ago

@seoungwugoh Hi, I'm trying to finetune your model. In the paper, you claim that batchnorm is turned off for all experiments. Just to be clear, do you turn off your batchnorm during the main training stage with video only or also during pre-training with images?

seoungwugoh commented 4 years ago

@sourabhswain Code in this repository does not contains functionality for evaluating Youtube-VOS. You should implement by yourself. But, It will not too difficult. To get a similar number with the paper, you should estimate masks for objects start to appear in the middle of video.

@npmhung We turned off Batchnorm for both pre-training and main-training. In other words, we use mean and var learned from ImageNet. This can be simply done by setting model.eval() during training.

hkchengrex commented 4 years ago

@seoungwugoh Is it possible for you to also provide the checkpoint used for Youtube-VOS evaluation (I'm ok without the code)? Thanks a lot!

sourabhswain commented 4 years ago

@seoungwugoh I made the changes specific to YouTube-VOS and now I can get a score 74.17. It's still a bit off from the score mentioned in the paper (78.4). Could it be just due to the different pretrained model which you uploaded here ? Or do you use different hyperparameters for YouTube-VOS ?

seoungwugoh commented 4 years ago

@sourabhswain It would be due to different weights. The number in paper (78.4) is measured using weights for Youtube-VOS. Unfortunately, we have no plan to upload weights for Youtube-VOS testing.

chenz97 commented 4 years ago

Hi @seoungwugoh , you mentioned that when objects started to appear in the middle of the video, you overwrite the current mask. So only the prev mask was impacted, and the first frame mask remains unchanged. However, for the objects that appear later, they cannot refer to the first frame for the GT mask (since the "first frame" for them is not the first frame of the video). Can this hurt the performance, or do you have any workaround for this? Thank you!