why use randomly embedding of image feature can train the language model

I notice the layer after vgg network.It is a linear layer which perform embedding of image feature. https://github.com/karpathy/neuraltalk2/blob/master/misc/net_utils.lua#L38

In training stage, it is not trained or finetuned and that means the output of cnn is random,. https://github.com/karpathy/neuraltalk2/blob/master/train.lua#L39

so what is input to language model is random. I think it cannot be trained with good result,but actually after 100000 iteration,the model can have cider of 0.8,very weird!!!!!!!!!

karpathy / neuraltalk2

why use randomly embedding of image feature can train the language model #161