pytorch / executorch

On-device AI across mobile, embedded and edge for PyTorch
https://pytorch.org/executorch/
Other
2.22k stars 369 forks source link

Question Regarding Code Execution in the Calibration Process. #6629

Open crinex opened 3 weeks ago

crinex commented 3 weeks ago

Dear @cccclai

I’m reviewing the code while following the guide(Export with Spinquant) you provided for converting the Llama3.2-3B-Instruct model with Qualcomm SpinQuant. When I execute the _export_llama function in the export_llama_lib.py file, the pt2e_quantize(quantizers) function is called. Within this function, the pt2e_calibrate function is executed before the convert_pt2e function. Why is pt2e_calibrate performed beforeconvert_pt2e here? Generally, wouldn't it make more sense to perform calibration after quantization?

Thank you

cccclai commented 3 weeks ago

Hi @crinex , thank you for checking out the Qualcomm tutorial!

Here is the process with code pointers

  1. During export_llama_lib.py, we will call .pt2e_quantize()
  2. Inside .pt2e_quantize function, we will do 3 things, step 1: run prepare_pt2e to insert observers, step 2: calibrate which will update the params in the observers, step 3: converted the observers to actual quant/dequant operator.

Does it answer your question?

crinex commented 2 weeks ago

Hi @cccclai Thank you for the explanation. It helped me understand.

I have another question. We are converting the Llama-3.2-3B-Instruct model to qnn_8a8w.

During the process of running export_llama to convert the model into a pte file, I wanted to verify the quantization state and changes in the model.

So, I executed the prepare_pt2e, pt2e_calibrate, convert_pt2e, DuplicateDynamicQuantChainPass, and export_to_edge functions, and then printed the model's parameter count and size (MB).

After checking, it turned out that the model’s size and parameter count remained unchanged through each of these steps. Specifically, the parameter count is 3,606,752,256, and the size is 13,758.66796875 MB.

It doesn’t seem like the model was actually quantized. I’m curious to know exactly where quantization takes place and where the model size should decrease.

Ultimately, we want to resolve the error that occurs when generating sentences with qnn_8a8w.

Thank you for always helping so kindly.

JacobSzwejbka commented 2 weeks ago

cc @kimishpatel @jerryzh168 on when the weights are actually converted.

I think most of the time in practice this happens in to_backend since most/all quantized ops today execute in delegates for ET. Quantized embedding might be an exception where we have a pass that replaces the pattern with a quantized op and packs the weight in the top level graph.

jerryzh168 commented 1 week ago

we quantize the weights in convert_pt2e: https://github.com/pytorch/pytorch/blob/891ba2ec8a3e2e71137fab4a8e91940a19c8272b/torch/ao/quantization/quantize_pt2e.py#L241

kimishpatel commented 1 week ago

As @jerryzh168 said convert_pt2e should have quantized weights. If you serialize the model at that point, you should see the impact on file size. I dont know how you are measuring the size after the listed steps

crinex commented 1 week ago

Dear @kimishpatel @jerryzh168 @JacobSzwejbka

I performed the functions prepare_pt2e, self.pt2e_calibrate, and convert_pt2e within the pt2e_quantize() function mentioned above, then set a debugging breakpoint. I measured the model size using the following code: def get_model_size(model):

    num_params = sum(p.numel() for p in model.parameters())
    param_size_bytes = sum(p.numel() * p.element_size() for p in model.parameters())
    param_size_megabytes = param_size_bytes / (1024 ** 2)
    return num_params, param_size_megabytes

For the model parameter, I continuously passed the value m, which is defined as m = prepare_pt2e(self.pre_autograd_graph_module, composed_quantizer), and checked the values using get_model_size(m).

Is there any part where I might have made a mistake?

kimishpatel commented 1 week ago

@jerryzh168 do you know? I suspect it might be related to how const prop works. If you do torch.export.save, does that refelct in model size?

jerryzh168 commented 1 week ago

yeah I think the quantized weights are not model.parameters, they will be buffers: https://github.com/pytorch/pytorch/blob/c98ef0279e6eb968f5f9d22e1f193e7064594152/torch/_export/passes/constant_folding.py#L45

we typically just save the state_dict and check file size: https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html#checking-model-size-and-accuracy-evaluation