According to Bert's architecture, the loss is calculated as the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood. Does this implementation includes the next sentence prediction loss when calculating loss? Does the use of 'SEP' tag will have any effect on training loss?
According to Bert's architecture, the loss is calculated as the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood. Does this implementation includes the next sentence prediction loss when calculating loss? Does the use of 'SEP' tag will have any effect on training loss?