Open rick2047 opened 1 year ago
I was going through the readme and noticed here that this model is performing better than the 7B llama on many things, even though its trained on a fifth of the tokens (200B vs 1T). Does anyone understand how this happened?
Probably GIGO (Garbage In Garbage Out), the two models are trained on different datasets.
I was going through the readme and noticed here that this model is performing better than the 7B llama on many things, even though its trained on a fifth of the tokens (200B vs 1T). Does anyone understand how this happened?