I suppose the code is for model training, where pairs of [img, caption] is available.
Why do we feed caption (our target predictions) into the decoder? Shouldn't the decoder only take encoded as input, and produce predictions for caption?
How should I use the trained model for inference, when onlyimg is available (andcaption is unknown/hidden)?
Hello,
Thank you for creating a great repository. I'm new to
x-transformers
and I'm a bit confused about the provided sample usage for image captioning:I suppose the code is for model training, where pairs of
[img, caption]
is available.caption
(our target predictions) into the decoder? Shouldn't thedecoder
only takeencoded
as input, and produce predictions forcaption
?img
is available (andcaption
is unknown/hidden)?Thanks in advance!