Open sudongwang-upc opened 2 months ago
So, in all decoder-only models, the "predict" part actually makes no sense during training, as all the points are used for the loss. Note that this is for all decoder-only models such as GPT.
In the code, a context and predict are separately used to support inference for a certain specific prediction length, given a certain fixed length context.
So, in all decoder-only models, the "predict" part actually makes no sense during training, as all the points are used for the loss. Note that this is for all decoder-only models such as GPT.
In the code, a context and predict are separately used to support inference for a certain specific prediction length, given a certain fixed length context.
Thank you very much for your reply. I got it!
The loss calculation during training uses the probability density function loss of all targets in context and predict, rather than just predict. Why?