Open mrcabbage972 opened 1 year ago
Further investigation - train.py has a --do-eval
option that also computes the perplexity. After running both the base model and the arxiv model through this on the arxiv dataset, I find the same discrepancy as in the dedicated perplexity script. This rules out any concern I had about if it was just a different data/tokenization pipeline in the perplexity script vs the train.
Our analysis in #53 has shown that the expert models we had previously trained actually have a higher perplexity than the base model.
Here are some issues that may have caused this:
The expert models were trained with an old version of the trainer, so we don't know which wandb run they belong to and what were the pile/domain data losses during the training. Re-doing the training of one of the experts should help clarify.