Hi,
In the GVE paper, the LRCN is modified so that class embeddings are passed at every time step to the second LSTM. I see that you are appending the one-hot class labels to the image features. This is different from the paper, which uses class embeddings comptued from average hidden state for a language model trained on the image features. You should correct this.
It works well enough with one-hot class embeddings. If you want this implementation to learn/support LSTM class embeddings, feel free to create a pull request.
Hi, In the GVE paper, the LRCN is modified so that class embeddings are passed at every time step to the second LSTM. I see that you are appending the one-hot class labels to the image features. This is different from the paper, which uses class embeddings comptued from average hidden state for a language model trained on the image features. You should correct this.