Saving and loading quantized models doesn't work?

tanishqkumar commented 7 months ago

I'm interested in profiling how well various architectures do after quantizing to various WxAx, and I'm using lm-eval to do so. lm-eval needs a path where a model is saved, but it seems that if one does AutoModelForCausalLM.from_pretrained(path) after doing model.save_pretrained(path) on a model model that was quantized using quanto, the quantized layers do not persist. This is problematic for lm-eval which reads saved models from a given path, so when it reads a model that was quantized then saved, it will get a regular unquantized model via from_pretrained. Is there any way around this, or plan to add a way to save quantization information in the save_pretrained and from_pretrained methods?

dacorvo commented 7 months ago

Did you quantize your model using transformers or quanto ? The transformers integration saves the quantization config during serialization. If you quantized your model using quanto, the quantization is serialized, but only quantized models can reload it for now. As a workaround you can quantize the new model first with dummy parameters (quantize(model)) before reloading the serialized one.

lsb commented 6 months ago

@tanishqkumar did you freeze() after you quantized? I could only save the quantized layers after I froze

SunMarc commented 6 months ago

To complete @dacorvo answer, we plan to add support via save_pretrained and from_pretrained for transformers integration. For now, the only way to save the model is by using quanto.

pratyushpal commented 6 months ago

what's the right way to save/load the models after quantizing? is there an example we can refer to?

dacorvo commented 6 months ago

So first, do not forget to freeze you model to statically convert your weights. Then, you can look at: https://github.com/huggingface/quanto/blob/main/examples/vision/image-classification/mnist/quantize_mnist_model.py Or in the tests: https://github.com/huggingface/quanto/blob/b9ee78335a6f0f90363da5909b5b749a1beaa4ce/test/model/test_quantize_mlp.py#L107

pratyushpal commented 6 months ago

Thank you for your quick response! I'm working with text generation and following 'quantize_causal_lm_model.py' in the examples with only weights during quantization. There isn't an example of how to save the model there. I'm guessing the right way to save/load is using safe_save and safe_load.

I have something like:

model = AutoModelForCausalLM.from_pretrained(model_path)
quantize(model, weights=weights)
freeze(model)
safe_save(model.state_dict(), state_dict_path)

# loading the state dict:
model_q = AutoModelForCausalLM.from_pretrained(model_path) 
model_q.load_state_dict(safe_load(state_dict_path)) # torch.load gives an error here

What's the right way of loading the state dict in this situation?

dacorvo commented 6 months ago

You need to quantize model_q, because otherwise the model does not know how to deal with quantized weights.

model = AutoModelForCausalLM.from_pretrained(model_path)
quantize(model, weights=weights)
freeze(model)
safe_save(model.state_dict(), state_dict_path)

# loading the state dict:
model_q = AutoModelForCausalLM.from_pretrained(model_path) 
quantize(model_q) # Parameters are unimportant because they will be overridden with what's in the state_dic
model_q.load_state_dict(safe_load(state_dict_path)) # torch.load gives an error here

calmitchell617 commented 6 months ago

Hello, like pratyushpal, I also found the quantize_causal_lm_model.py example then attempted save_pretrained().

I see the help wanted tag, what kind of help is needed? Maybe I can chip in.

dacorvo commented 6 months ago

Sorry, I planned to use the label to identify issues that are actually support requests, but "help wanted" is also a call to contributions, so that was not such a great idea.

calmitchell617 commented 6 months ago

Ok, no problem.

For anyone else who comes here looking for an example, here is one that is working for me. @dacorvo, does it look OK to you?

Also, a follow up question - it is taking quite a long time to load the quantized model for inference, with this methodology. Would that be fixed by the upcoming integration with save_pretrained()?

To quantize and save a model:

from transformers import AutoModelForCausalLM
from quanto import freeze, qint8, quantize, safe_save
from pathlib import Path
from time import time

model_id = 'codellama/CodeLlama-7b-Instruct-hf'
out_path = 'out/quantized'
overall_start = time()

# make sure out_path is empty
p = Path(out_path)
if p.is_dir():
    p.rmdir()
elif p.is_file():
    p.unlink()

start = time()
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    low_cpu_mem_usage=True,
)
print(f'Finished loading model, time taken: {time() - start:.2f} seconds')

print('Quantizing model')
start = time()
quantize(model, weights=qint8)
print(f'Finished quantizing model, time taken: {time() - start:.2f} seconds')

print('Freezing model')
start = time()
freeze(model)
print(f'Finished freezing model, time taken: {time() - start:.2f} seconds')

print('Saving model')
start = time()
safe_save(model.state_dict(), out_path)
print(f'Finished saving model, time taken: {time() - start:.2f} seconds')

print(f'Total time taken: {time() - overall_start:.2f} seconds')

Output on my computer:

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 13.03it/s]
Finished loading model, time taken: 1.22 seconds
Quantizing model
Finished quantizing model, time taken: 32.18 seconds
Freezing model
Finished freezing model, time taken: 6.93 seconds
Saving model
Finished saving model, time taken: 8.56 seconds
Total time taken: 49.80 seconds

To load and run inference on a quantized model:

from transformers import AutoTokenizer, AutoModelForCausalLM
from time import time
from quanto import quantize, safe_load

model_id = 'codellama/CodeLlama-7b-Instruct-hf'
model_location = "out/quantized"
overall_start = time()

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id

start = time()
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", low_cpu_mem_usage=True)
print(f'Finished loading model, time taken: {time() - start:.2f} seconds')

print('Quantizing model')
start = time()
quantize(model)
print(f'Finished quantizing model, time taken: {time() - start:.2f} seconds')

print('Loading state dict')
start = time()
model.load_state_dict(safe_load(model_location))
print(f'Finished loading state dict, time taken: {time() - start:.2f} seconds')

print('Moving model to cuda and setting to eval mode')
start = time()
model.to("cuda")
model.eval()
print(f'Finished moving model to cuda and setting to eval mode, time taken: {time() - start:.2f} seconds')

print(f'Total time taken for model loading and quantization: {time() - overall_start:.2f} seconds')

messages = [
    {"role": "system", "content": "You are a chatbot."},
    {"role": "user", "content": "What does it take to build a great LLM?"},
]
tokenized = tokenizer.apply_chat_template(
    messages,
    return_dict=True,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    padding=True,
)
# tokenized = tokenizer(input_text, return_tensors="pt", padding=True)
input_ids = tokenized.input_ids.to("cuda")
attention_mask = tokenized.attention_mask.to("cuda")

outputs = model.generate(input_ids, attention_mask=attention_mask, max_new_tokens=256)

print(tokenizer.decode(outputs[0]))

Output on my computer:

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11.83it/s]
Finished loading model, time taken: 1.20 seconds
Quantizing model
Finished quantizing model, time taken: 31.91 seconds
Loading state dict
Finished loading state dict, time taken: 6.23 seconds
Moving model to cuda and setting to eval mode
Finished moving model to cuda and setting to eval mode, time taken: 2.12 seconds
Total time taken for model loading and quantization: 41.69 seconds
Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
<s> [INST] <<SYS>>
You are a chatbot.
<</SYS>>

What does it take to build a great LLM? [/INST]  Building a great LLM (Master of Laws) program requires a combination of academic rigor, practical experience, and a commitment to excellence in teaching and learning. Here are some key factors to consider:

1. Academic rigor: The LLM program should be rigorous and challenging, with a focus on advanced legal research and analysis. The curriculum should include a range of courses that cover relevant legal topics, including contracts, torts, intellectual property, and international law.
2. Practical experience: LLM students should have the opportunity to gain practical experience in their chosen area of law. This can be achieved through internships, clinics, or other hands-on learning experiences.
3. Teaching and learning: The LLM program should be designed to promote effective teaching and learning. This includes the use of innovative teaching methods, such as flipped classrooms and online learning, and the incorporation of technology to enhance student engagement and interaction.
4. Collaboration and networking: The LLM program should foster collaboration and networking among students, faculty, and alumni. This can be achieved through joint research projects, seminars, and other events that bring students

dacorvo commented 6 months ago

@calmitchell617 that's correct, thank you very much for this contribution. You may be able to reduce the model loading time by using the meta device. I admit is is a bit convoluted at the moment, but you can try what is done in this test: https://github.com/huggingface/quanto/blob/b9ee78335a6f0f90363da5909b5b749a1beaa4ce/test/model/test_quantize_mlp.py#L139

dacorvo commented 6 months ago

I think a helper taking a model and a quantized state_dict as parameters and returning the quantized model might be a good idea.

calmitchell617 commented 6 months ago

I'm a relative beginner, but would be happy to try building that function.

dacorvo commented 6 months ago

OK, let me write an issue to explain a bit more what I expect.

dacorvo commented 6 months ago

Here you go: https://github.com/huggingface/quanto/issues/162.

calmitchell617 commented 6 months ago

Great, I'll give it a shot!

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

mapmeld commented 4 months ago

Maybe it'd help if in the readme, where it says "When freezing a model, its float weights are replaced by quantized integer weights.", that section could either add "To preserve changes, you must freeze the model before running save_pretrained." or add a line to the code sample:

freeze(model)
model.save_pretrained(model_path)

Currently the only line in the repo with save_pretrained is in the external folder, which doesn't use quanto or freeze

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

dacorvo commented 3 months ago

The recommended way to save a quanto model is through a state_dict that can later be reloaded using optimum.quanto.requantize.

dacorvo commented 3 months ago

A paragraph could be added to the README, for instance using safetensors for serializing the state_dict.

huggingface / optimum-quanto

Saving and loading quantized models doesn't work? #136