Dose detailed caption improve the performance?

I have a few questions about the video caption.

I noticed that during the training, the caption in the video CSV is quiet short. Will the performance improve if we use a detailed caption during the test time?
If we also use detailed caption during the training time, will that improve the model performance?
The caption used now focuses more on the frame description rather than the video dynamics. Will that be improved with a capture describing the dynamics? If so, do you have any suggestions on generating that captions?

Thanks a lot!

snap-research / Panda-70M