Can you help clarify what's the actual fine-tune epoch used for MSR-VTT dataset? In the paper it says 5 epochs is used (which is the common setting), however in the config file here , it says 100?
There are 20 sentences for each video in MSR-VTT, as our implementation samples one sentence in each iteration, we set epoch=100 in the config to reach the actual epoch=5.
Can you help clarify what's the actual fine-tune epoch used for MSR-VTT dataset? In the paper it says 5 epochs is used (which is the common setting), however in the config file here , it says 100?