Open Iambestfeed opened 5 months ago
@Iambestfeed Why would you want to include lm_head
for quantization? Embedding models do not need it anyway.
@intfloat I'm looking at quantization algorithms like AWQ, GPTQ and they seem to work on minimizing loss based on the model's output (so I'm hoping to have lm_head
so I can test to see if can use these algorithms?
What about minimizing the MSE loss between the embedding vectors before and after quantization? Using lm_head
makes little sense for embeddings, they are not fine-tuned with other parameters.
@intfloat Hmm, finetuning with MSE loss sounds like a practical idea. Do you think I should implement with fp16 model as a teacher and 4-bit model as a student model?
If you want to optimize GPU memory usage and speed up inference, that surely makes sense.
@intfloat I actually don't have many GPUs to do the training experiments. Do you have any other tips for optimizing GPU memory usage and speeding up inference? (I hope they will be Zero-shot)
I had a passing idea about whether it is possible to use quantization for embedding models using Mistral?
However, one problem is that currently the checkpoint lacks the
lm-head
part, so I'm wondering if there would be any difference if using the full checkpoint for quantization instead of just the embedding part?