I'm trying to reproduce the results in the paper, and it's not clear how many training iterations or epochs were done for each dataset. The default number of steps appears to be 1,300,001, but this is way too high. Could you clarify the right number for text8, 1BW, and OpenWebText?
I'm trying to reproduce the results in the paper, and it's not clear how many training iterations or epochs were done for each dataset. The default number of steps appears to be 1,300,001, but this is way too high. Could you clarify the right number for text8, 1BW, and OpenWebText?