Open arunos728 opened 7 months ago
I am also facing an issue when training. The instance loss is mostly 0 and upon tracking the valid_instance variable (https://github.com/shvdiwnkozbw/SMTC/blob/main/src/train.py#L253), its also mostly 0 through out the training. Did you face similar issue ? Is this expected ?
Thanks in advance !
Regarding the questions, could you share the training batchsize, GPU numebrs, learning rate? Large learning rate with small global batchsize could result in the performance drop on label propagation vs original DINO. However, the 29.1 J&F is not expected, just ensure the DINO weights are corretly loaded at the beginning of training.'
Besides, in terms of the instance loss, we curretly set a strict standard to filter the valid instances to ensure the high quality instance samples in training. Hence, it requires long training iterations to let the model satisfy this standard. Typically after 15k iterations with global batchsize 512, and more iterations are required with smaller batchsize. It is also feasible to relax the valid instance selection standard to reach a tradeoff between the instance quality and learning efficiency. We will carefully check it and update a new version.
Thanks!
Hi Authors,
Thanks for the response. On instance loss issue, I was running with batch size of 16 on a single gpu. Currently I can fit a maximum batch size of 160 ( 4 gpus - 40 each). If possible, can you please suggest a suitable learning rate and iterations for small batch.
Thanks !
Hi, for bath size 16x4 gpu, I recommend using the initial learning rate 1e-5. And you can first try freeze the DINO initialized ViT parameters and only tune the slot attention as well projection layers to see whether you can obtain the valid instances.
We have updated a new version for more efficient training, and also attached the model weights trained with the new strategy.
Thanks for your detailed answer. I'll re-train the model and evalulate it as soon as possible. BTW, there is a weird thing on the updated label propagation result. I've tested both ImageNet pretrained ViT and the uploaded pretrained SMTC model. I've checked the pretrained SMTC model reaching 65.2 J&F, but ImageNet pretrained ViT reaching 65.9 J&F which outperformed SMTC model. Could you explain why this happens?
Hello, thanks for sharing the work.
I was trying to reproduce the label propogation result (DAVIS 2017) with pretrained DINO (ViT-S/16).
I've checked the ImageNet pretrained DINO reaching 61.8 J&F as reported in the paper by using the CRW code.
However, when I fine-tuned the proposed architecture on Youtube-VOS (start_twostage.sh, 10k iterations) and evaluated on DAVIS by using the encoder part weights,
The performance was 29.1 J&F, which is far below than the reported one (67.6 J&F).
I've checked the proposed losses (differ loss, bidirection consistency loss) worked well during the training stage.
I guess something is missing during training, but I couldn't find it.
Could you upload the training log & the pretrained model to check the result?