This repository contains the code for a video captioning system inspired by Sequence to Sequence -- Video to Text. This system takes as input a video and generates a caption in English describing the video.
Since there are multiple captions for a single video so you have tooked all the caption for training or just selected random caption from the multiple caption in every epoc?
And second one is, did you took whole caption during training or you took first word and video feat then predicted next word so on...?
Hey! Nice implementation