How to reproduce the results?

shvdiwnkozbw / SMTC

Code for Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos

6 stars 0 forks source link

How to reproduce the results? #3

Open arunos728 opened 7 months ago

arunos728 commented 7 months ago

Hello, thanks for sharing the work.

I was trying to reproduce the label propogation result (DAVIS 2017) with pretrained DINO (ViT-S/16).

I've checked the ImageNet pretrained DINO reaching 61.8 J&F as reported in the paper by using the CRW code.

However, when I fine-tuned the proposed architecture on Youtube-VOS (start_twostage.sh, 10k iterations) and evaluated on DAVIS by using the encoder part weights,

The performance was 29.1 J&F, which is far below than the reported one (67.6 J&F).

I've checked the proposed losses (differ loss, bidirection consistency loss) worked well during the training stage.

I guess something is missing during training, but I couldn't find it.

Could you upload the training log & the pretrained model to check the result?

VarunBelagali98 commented 7 months ago

I am also facing an issue when training. The instance loss is mostly 0 and upon tracking the valid_instance variable (https://github.com/shvdiwnkozbw/SMTC/blob/main/src/train.py#L253), its also mostly 0 through out the training. Did you face similar issue ? Is this expected ?

Thanks in advance !

shvdiwnkozbw commented 7 months ago

Regarding the questions, could you share the training batchsize, GPU numebrs, learning rate? Large learning rate with small global batchsize could result in the performance drop on label propagation vs original DINO. However, the 29.1 J&F is not expected, just ensure the DINO weights are corretly loaded at the beginning of training.'

Besides, in terms of the instance loss, we curretly set a strict standard to filter the valid instances to ensure the high quality instance samples in training. Hence, it requires long training iterations to let the model satisfy this standard. Typically after 15k iterations with global batchsize 512, and more iterations are required with smaller batchsize. It is also feasible to relax the valid instance selection standard to reach a tradeoff between the instance quality and learning efficiency. We will carefully check it and update a new version.

Thanks!

VarunBelagali98 commented 7 months ago

Hi Authors,

Thanks for the response. On instance loss issue, I was running with batch size of 16 on a single gpu. Currently I can fit a maximum batch size of 160 ( 4 gpus - 40 each). If possible, can you please suggest a suitable learning rate and iterations for small batch.

Thanks !

shvdiwnkozbw commented 7 months ago

Hi, for bath size 16x4 gpu, I recommend using the initial learning rate 1e-5. And you can first try freeze the DINO initialized ViT parameters and only tune the slot attention as well projection layers to see whether you can obtain the valid instances.

shvdiwnkozbw commented 7 months ago

We have updated a new version for more efficient training, and also attached the model weights trained with the new strategy.

arunos728 commented 7 months ago

Thanks for your detailed answer. I'll re-train the model and evalulate it as soon as possible. BTW, there is a weird thing on the updated label propagation result. I've tested both ImageNet pretrained ViT and the uploaded pretrained SMTC model. I've checked the pretrained SMTC model reaching 65.2 J&F, but ImageNet pretrained ViT reaching 65.9 J&F which outperformed SMTC model. Could you explain why this happens?