Experiments with Larger Vocabularies for Llama 2 Models?

sail-sg / scaling-with-vocab

[NeurIPS-2024] 📈 Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies https://arxiv.org/abs/2407.13623

71 stars 4 forks source link

Thank you for this interesting study on vocabulary scaling laws.

I'm curious if you ran any experiments comparing the performance of Llama 2 models with larger vocabularies as predicted by your approaches - specifically Llama 2 7B with a 57K vocabulary, Llama 2 13B with a 79K vocabulary, and Llama 2 70B with a 216K vocabulary.

If so, how did the results compare to the original Llama 2 models with 32K vocabularies? If not, do you have plans to conduct such experiments in future work? Is it bottlenecked by GPU memory wall?

It is not shown on paper but I think if it is memory problem I can help on this issue.

It would be valuable to see empirical validation of your predictions on these widely-used model scales. Thank you!

sail-sg / scaling-with-vocab

Experiments with Larger Vocabularies for Llama 2 Models? #2