shvdiwnkozbw / SSL-UVOS

[ECCV 2024] Code for Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation
24 stars 1 forks source link

How can I reproduce the results? #1

Open arunos728 opened 2 weeks ago

arunos728 commented 2 weeks ago

Thanks for sharing your interesting work.

I have a few questions about reproducing the results.

I've prepared Youtube VIS 2019 (2883 training videos) training set and followed the training details in the paper. I used 16 batch size and 1e-4 learning rate, and trained the model 30K iterations with 2 GPUs. I checked that the training process was going well and the performance on DAVIS-17 was 33.0 J&F at 1K iterations, & 40.0 J&F at 5K iterations. However, I could not evaluate the final model (30K iterations) since the hierarchical clustering generates too many clusters for each video (over 300). I guess the KL distance values in the clustering are getting larger in the training process. I've tried to control the tau value in the clustering, but it didn't help to increase the performance. Are there some missing conditions for reproducing the results?

shvdiwnkozbw commented 2 weeks ago

Based on the results from 1K and 5K iterations, it seems the training process is correct. What is the threshold you set in hierarchical clustering? And did you visualize the cluster assignments to see whether it is because the clustering is too fragmented or due to some other reasons?

arunos728 commented 2 weeks ago

The threshold value tau is set to 1.0. I didn't visualize the clusters but I've checked the distance values of the final model are larger than the values of the 5K iter model. I could get a reasonable number of clusters when the tau is set to 1.5 or decrease the learning rate to 1e-5, but the performances were bad. What can I do to handle this issue?