Closed wayne3771 closed 6 months ago
We use the provided training scripts, and the checkpoints we provided were trained using these scripts as well. Could you please specify which datasets you're referring to for the inaccurate accuracy rates?
I conducted testing on the ActivityNet Captions dataset for temporal grounding task, resulting in a low miou (about half of your results), and the checkpoints you provided indeed achieve high scores. The only reason I can think of is that some features are missing in the stage2 training (howerve it's negligible compared to the large amout of training data).
Additionally, I conduct further testing and find out that the stage3 training actually degrades the model performace in my experiment.
There are indeed ~5% missing features, which is consistent with how our checkpoint was trained. I'm currently not sure what might be causing the low accuracy. Maybe you can try chatting with it using any video to assess whether it has been trained to a satisfactory model.
I finally find out that the mistake results from batchsize. Since I use multiple gpus for training, I forget to modify the parameter per_device_train_batch_size, which causes training batchsize to become bigger. Now I can reproduce the correct experimental results. Anyway, thanks for your reply.
Do you have any training techniques? I have conducted three training experiments and only achieved half of the official accuracy rate