Closed rodrigonogueira4 closed 1 year ago
Thanks, first author here. We considered this, but didn't have time, but the reasons why we down-weighted this were:
Hi Ross, great, thanks for your reply!
nw, happy to answer any other questions about the paper - let me know!
First of all, great work!
Did you try pretraining Galactica using the original OPT checkpoint as a starting point? Since both models have similar architectures and Galactica's dataset is "only" 110B tokens, I imagine that starting from a model that was pretrained on more data would bring some gains.