Closed SinanAkkoyun closed 1 year ago
I think you just need to change the hidden size, number of attention heads and number of layers to obtain a smaller model. Inference should just work with the transformers llama implementation. You can take a look at our 3b configuration here.
@young-geng Thank you very much! But what hyperparameters are preferred? Is there a rule of thumb for attention heads, hidden size etc?
There's usually no agreed upon configurations for such small model, so you have a lot of freedom defining it yourself. Maybe you can get some inspirations from Table A9 of the Chinchilla paper
Thank you very much! :)
Hey! :) I wanted to ask, what goes into making your own 50M (or x param) model from the LlaMa architecture? Completely disregarding pretraining, just random weights, what do I need to look out for to make inference work for such models? I am specifically interested in GPTQ quantizing it and running it with for example Exllama
So, what hyperparams should I choose and what do I need to look out for? Thank you for your time.