How much compute will this take?

princeton-nlp / LLM-Shearing

[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

https://arxiv.org/abs/2310.06694

MIT License

533 stars 39 forks source link

How much compute will this take? #22

Closed fakerybakery closed 8 months ago

fakerybakery commented 9 months ago

Hi, If I want to make a 1B/3B model for Mistral, do you know approximately how many dollars I'll have to spend in compute, and whether I can do it on a consumer GPU? Thanks!

xiamengzhou commented 9 months ago

We use approximantely 1845 A100 GPU hours and 3310 A100 GPU hours to get the 1.3B and 2.7B model. However, the actual execution also heavily dependent on your set up and cluster speed.

fakerybakery commented 9 months ago

Also, are you planning to release a sheared Mistral version?

xiamengzhou commented 9 months ago

We use an in-house cluster at Princeton! I think A100 should be more expensive than $0.5 per hour though.

xiamengzhou commented 9 months ago

Also, are you planning to release a sheared Mistral version?

We intend to add support for the mistral and pythia models in the upcoming weeks. We are in short of computes -- so I am not sure if we will end up delivering these models before the next stronger 7B model comes out.

SinanAkkoyun commented 9 months ago

Hi, when having full control over all finetuning data, does it make the most sense to first shear the base model and then finetune on top? Or is it better to finetune in advance (or a mixture of both)? Completely disregarding cost, just purely performance and overfitting related

xiamengzhou commented 9 months ago

Hi! Yeah I think it makes most sense to prune the base model first then finetune, as it's largely believed that the abilities of language models are enabled by pre-training. This is the most neat way to execute.

However, I am not too sure about what the performance will be like when mixing pre-training and fine-tuning data for pruning -- it might have the benefit to help the pruning process find a submodel that better follows instructions.

SinanAkkoyun commented 9 months ago

Alright, tysm!