microsoft / TransformerCompression

For releasing code related to compression methods for transformers, accompanying our publications
MIT License
344 stars 30 forks source link

Mistral Support #81

Open fakerybakery opened 5 months ago

fakerybakery commented 5 months ago

Hi, Thanks for releasing this work! Are there any plans to release a Mistral version? Thanks!

nailimixaM commented 5 months ago

Hi, Thanks for releasing this work! Are there any plans to release a Mistral version? Thanks!

Hi! Yes, Mistral 7B is on our radar but we don't have an implementation for it yet. Our adapter classes should make it straightforward to add any HF model, if you're up for contributing?

kno10 commented 5 months ago

In particular Mixtral (with x, the mixture of experts version) could benefit a lot from this. At 47B parameters it is slightly too large for 80 GB with bfloat16. Reducing this even slightly to fit on a single 80 GB GPU would effectively reduce the cost to operate this by half, and likely reduce latency, too?

As Mistral and Mixtral are Apache licenced, you could share smaller sliced versions.

nailimixaM commented 5 months ago

In particular Mixtral (with x, the mixture of experts version) could benefit a lot from this. At 47B parameters it is slightly too large for 80 GB with bfloat16. Reducing this even slightly to fit on a single 80 GB GPU would effectively reduce the cost to operate this by half, and likely reduce latency, too?

As Mistral and Mixtral are Apache licenced, you could share smaller sliced versions.

Great suggestion, for MoEs we need to modify the method slightly to account for the different architecture - they won't work out of the box with our current adapters. The computational invariance on which SliceGPT relies still applies though, so they should be sliceable.

noah-kim-theori commented 5 months ago

I write mixtral implementation of slicegpt. Here is my forked repository, https://github.com/noah-kim-theori/TransformerCompression, experiments/run_mixtral_slice.py. Feel free to use it.

nailimixaM commented 5 months ago

I write mixtral implementation of slicegpt. Here is my forked repository, https://github.com/noah-kim-theori/TransformerCompression, experiments/run_mixtral_slice.py. Feel free to use it.

Amazing, nice work @noah-kim-theori! Could you share some perplexity and zero-shot accuracies of a sliced mixtral at e.g. 25% slicing vs dense? run_slicegpt_perplexity.py and run_zero_shot_tasks.py with default values would be great. That should show that SliceGPT is working as expected. Assuming that works we'd welcome a PR adding mixtral to the repo 👍