Implement the content of the predict_model.py file.
In the file a stateful model should be loaded to use it for prediction.
The current implementation requires that an image has to be given as an input to the model together with a sequence of length MAX_CAPTION_LENGTH, where the first element of the sequence is 1 *corresponding to the <start>token) and all the others are zeros (for masking).
Then we take the model output at the first timestep, corresponding to the softmax probability distribution over the vocabulary, and sample one word from this probability distribution (remember that here the 0-th neuron corresponds to the token <start> which has index 1 in the vocabulary and so on). Sampling can be performed:
by taking a random element according to the probability distribution
by taking the argmax of the probability distribution
with a beam search.
These three methods should all be implemented.
When we sample a word, we create the input for the next timestep, made of the same image (which will be encoded by the model but not used - adjust this) and a new caption sequence where the first element is the index of the sampled word and all the other elements are 0. The fact that the model is stateful, means that it will keep the state it had at the previous timestep and can continue the prediction of the sequence without problems. We will sample until MAX_CAPTION_LENGTH or until <end> gets produced by the model.
Implement the content of the
predict_model.py
file.In the file a stateful model should be loaded to use it for prediction.
The current implementation requires that an image has to be given as an input to the model together with a sequence of length
MAX_CAPTION_LENGTH
, where the first element of the sequence is1
*corresponding to the<start>
token) and all the others are zeros (for masking).Then we take the model output at the first timestep, corresponding to the softmax probability distribution over the vocabulary, and sample one word from this probability distribution (remember that here the 0-th neuron corresponds to the token
<start>
which has index 1 in the vocabulary and so on). Sampling can be performed:These three methods should all be implemented.
When we sample a word, we create the input for the next timestep, made of the same image (which will be encoded by the model but not used - adjust this) and a new caption sequence where the first element is the index of the sampled word and all the other elements are 0. The fact that the model is stateful, means that it will keep the state it had at the previous timestep and can continue the prediction of the sequence without problems. We will sample until
MAX_CAPTION_LENGTH
or until<end>
gets produced by the model.