About Activitynet eval process

lixuefenfen commented 5 months ago

Thank you very much for your excellent work. While I was reproducing the results of your activitynet dataset on the video grounding task, I noticed that the system often threw exceptions, as shown in the figure, it seems that some videos are not available? Do you encounter similar problems when using activitynet dataset for evaluation? Do you know what causes it? In addition, on the activitynet dataset, the results I reproduced are quite different from the results you published, which are: R1@3:40.58, R1@5:26.97, R1@0.7:13.55, mIoU: 28.75, I'm sure I didn't change any parameters in your source code, can you think of any reason to explain the drop in results, I would be grateful. ![Uploading 641b6f524e502c1437afc0d55fd5f1d.png…]() ![Uploading 85914352ed317bf0bc6aca6d4ae6703.png…]()

huangb23 commented 5 months ago

It looks like your image did not upload successfully. Indeed, some videos are not available. You can try downloading the missing videos directly from the official website. Also, could you please let us know if you used the checkpoint we provided for testing or if you trained the model yourself?

lixuefenfen commented 5 months ago

Hello, thank you for your reply. I am sure that I downloaded the dataset from the following official website :http://activity-net.org/download.html, there are 19,994 videos in total. Could you please tell me where I can download them if my video data is missing? Also, on the question of replication, the above data are the results of retraining on your published dataset, and I recently evaluated it on your checkpoint, the results are as follows: mIou: 29.46, R@0.3:41.71, R@0.5: 28.13, R@0.7:13.58. Generally speaking, it still does not reach the results you published, but it is a little higher than my training results. May I ask if the reason for these situations is that the data I used for the test video is incomplete, or is it due to the reason of my GPU, I used 24G 3090ti。

huangb23 commented 5 months ago

I believe this error is within an acceptable range, and the factors contributing to the error might include the following:

Our codebase is based on LLaVA, which underwent a code refactor in November 2023. Consequently, we also modified this part of the code during our refactor in January 2024, and we retrained and uploaded the checkpoint using the new code. However, we only conducted simple tests on the new checkpoint's language capabilities and did not perform temporal grounding and dense captioning tests, which might cause discrepancies with the original code's training results.
The model appears to be sensitive to hyperparameters. As mentioned in issue #26, changes in batch size can significantly affect the training results. We did not notice this during our initial training, and we made slight adjustments to the batch size and gradient_accumulation_steps based on different GPUs, which might have impacted the final results.
We did not fix the random seed, and different CUDA versions, PyTorch versions, and GPUs might have various optimizations for computation. The floating-point errors resulting from these differences could also lead to variations in training outcomes. As you mentioned, the checkpoint results we provided are slightly higher than those you trained, and I believe this falls within the margin of error.

huangb23 / VTimeLLM

About Activitynet eval process #29