Ablation: Pretrain first on OPT data _then_ on scientific texts?

rodrigonogueira4 commented 1 year ago

First of all, great work!

Did you try pretraining Galactica using the original OPT checkpoint as a starting point? Since both models have similar architectures and Galactica's dataset is "only" 110B tokens, I imagine that starting from a model that was pretrained on more data would bring some gains.

RJT1990 commented 1 year ago

Thanks, first author here. We considered this, but didn't have time, but the reasons why we down-weighted this were:

How far can we go with scientific text alone?. This was a contrarian take we took for this work. Note we beat OPT on non-scientific tasks like BIG-Bench tasks (where we'd expect OPT should beat us!). So is fine-tuning from a general model even needed?
Scientific tokenizer is likely more efficient. The language properties of scientific text differ significantly from general text. While I'm sure fine-tuning would work just fine, given the nature of scientific text and modalities, we opted for a specialized model.

rodrigonogueira4 commented 1 year ago

Hi Ross, great, thanks for your reply!

RJT1990 commented 1 year ago

nw, happy to answer any other questions about the paper - let me know!

paperswithcode / galai

Ablation: Pretrain first on OPT data _then_ on scientific texts? #45