marmot-xy / CMBS

cross modal background suppression for audio-visual event localization
33 stars 6 forks source link

Why can't I achieve the accuracy in the paper using your code? #5

Open libaolong4473 opened 1 month ago

libaolong4473 commented 1 month ago

I am using the same environment and parameters as you, with a GeForce RTX 3080TI GPU and existing VGG-like audio and VGG visual features. Why is the best accuracy for supervised tasks 78.184 and weakly supervised tasks 72.3. Is it because of my GPU ? We look forward to your reply,thank you.

marmot-xy commented 1 month ago

The AVE dataset has a relatively small amount of data, so testing on it may inevitably introduce some performance variance due to the random initialization of the model. You can try changing a few random seeds or adjusting the parameters, which might result in higher performance. Alternatively, you can directly use the checkpoint we provide for testing, which should achieve results similar to those reported in our paper.

libaolong4473 commented 1 month ago

Thank you for your reply!