karpathy / neuraltalk2

Efficient Image Captioning code in Torch, runs on GPU
5.49k stars 1.26k forks source link

why use randomly embedding of image feature can train the language model #161

Closed vanpersie32 closed 7 years ago

vanpersie32 commented 7 years ago

I notice the layer after vgg network.It is a linear layer which perform embedding of image feature. https://github.com/karpathy/neuraltalk2/blob/master/misc/net_utils.lua#L38

In training stage, it is not trained or finetuned and that means the output of cnn is random,. https://github.com/karpathy/neuraltalk2/blob/master/train.lua#L39

so what is input to language model is random. I think it cannot be trained with good result,but actually after 100000 iteration,the model can have cider of 0.8,very weird!!!!!!!!!