microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.11k stars 2.44k forks source link

[unilm/e5] About the full checkpoint of Mistral-E5 #1430

Open Iambestfeed opened 5 months ago

Iambestfeed commented 5 months ago

I had a passing idea about whether it is possible to use quantization for embedding models using Mistral?

However, one problem is that currently the checkpoint lacks the lm-head part, so I'm wondering if there would be any difference if using the full checkpoint for quantization instead of just the embedding part?

intfloat commented 5 months ago

@Iambestfeed Why would you want to include lm_head for quantization? Embedding models do not need it anyway.

Iambestfeed commented 5 months ago

@intfloat I'm looking at quantization algorithms like AWQ, GPTQ and they seem to work on minimizing loss based on the model's output (so I'm hoping to have lm_head so I can test to see if can use these algorithms?

intfloat commented 5 months ago

What about minimizing the MSE loss between the embedding vectors before and after quantization? Using lm_head makes little sense for embeddings, they are not fine-tuned with other parameters.

Iambestfeed commented 5 months ago

@intfloat Hmm, finetuning with MSE loss sounds like a practical idea. Do you think I should implement with fp16 model as a teacher and 4-bit model as a student model?

intfloat commented 5 months ago

If you want to optimize GPU memory usage and speed up inference, that surely makes sense.

Iambestfeed commented 5 months ago

@intfloat I actually don't have many GPUs to do the training experiments. Do you have any other tips for optimizing GPU memory usage and speeding up inference? (I hope they will be Zero-shot)