Closed overwhelmedxh closed 4 months ago
Hi, sorry for the late reply; I was caught up with a deadline. Regarding your issue, I suspect the root cause might be related to the batch size. In DGL on the MSRVTT dataset, we trained our model on 4 A6000 GPUs with a batch size of 128 (lr 2e-3) per card, or on 8 A100 GPUs with a batch size of 64 per card (lr 5e-3), resulting in a total batch size of 512. If your batch size is smaller than 512, you should adjust the learning rate accordingly. However, we haven't found a suitable learning rate for other batch sizes, which may hinder the performance.
Thankyou for your code shared! The following error occurs when i train the DGL model with code provided. Can you help me: