Maybe it would be better to note in your code that while you're training by minimizing the CE loss, Bengio actually maximized the log-likelihood. I know that it is equivalent in this case (one-hot vectors as ground-truth), but that's not the case in general, so maybe better to note. Thanks!
Hi @karpathy, thanks for that great repo!
Maybe it would be better to note in your code that while you're training by minimizing the CE loss, Bengio actually maximized the log-likelihood. I know that it is equivalent in this case (one-hot vectors as ground-truth), but that's not the case in general, so maybe better to note. Thanks!