huggingface / optimum-quanto

A pytorch quantization backend for optimum
Apache License 2.0
786 stars 56 forks source link

Write a helper to reload a quantized state_dict #162

Closed dacorvo closed 4 months ago

dacorvo commented 5 months ago

quantized weights, scales and metadata can be quantized into a state_dict that can later be reloaded and applied to a quantized model.

The process is a bit convoluted, as it requires the target model to be quantized first without any parameters (to make it quantization "aware").

The goal of this issue is to implement a helper in quantize.py with the following signature:

def requantize(model: torch.nn.Module, state_dict: Dict[str, Union[torch.Tensor, str]):

The helper will simply quantize the model and reload the state_dict, assigning the Tensors to also apply the correct dtypes.

The most important part of the issue is to write dedicated unit tests to check it works in every configuration.

Test could for instance be added in a new test/model/test_requantize.py test file.

ManoBharathi93 commented 5 months ago

Hi @dacorvo , My initial approach was

  1. quantize the model using quantize function.
  2. Iterate the state_dict and check if it's value is tensor type then it's weight so I will use quantize_weight to qauntize weights else it's metadata .[How do I quantize the metadata ?]
    state_dict[name]  = qz(weight)  - Here qz - quantize_weight() from 
    state_dict[name] = qx(metadata) - here qx - How do I quantize the metadata ? 

What is scales in state_dict ? are you referring bias ?

What do you think of my approach ?

dacorvo commented 5 months ago

The goal of the issue is NOT to create a quantized state_dict: this is already handled. The goal here is simply to wrap the operations done in sequence when reloading quantized weights. See for instance: https://github.com/huggingface/quanto/blob/b9ee78335a6f0f90363da5909b5b749a1beaa4ce/test/model/test_quantize_mlp.py#L139

ManoBharathi93 commented 5 months ago

If I am not wrong, this is what we need to do it in helper requantize model = quantize(model) model.load_state_dict(state_dict,assign=True)

dacorvo commented 5 months ago

Except that it is:

quantize(model)

because quantization happens in place

ManoBharathi93 commented 5 months ago

Thanks, I will think about unit test !

calmitchell617 commented 5 months ago

Looks like Mano is already working on this issue. I will keep an eye on this issue, in case his solution isn't accepted for whatever reason.

ManoBharathi93 commented 5 months ago

Cal If I am not wrong, test function can be further extended..

The process is a bit convoluted, as it requires the target model to be quantized first without any parameters (to make it quantization "aware").

@calmitchell617 , As david said. we need to quantize model first without any parameters but in my test case is quantize model with parameters.

calmitchell617 commented 5 months ago

Ok, I will make a contribution soon.

I think it is important to consider (and test) the use case of requantizing a large Huggingface Transformers model. It was trivial to requantize the MLP class in the existing tests, but it was more difficult to do so with a model loaded via from_pretrained(). The Transformers model loaded and worked fine, but it took some tinkering to get it to load in a memory efficient way.

Here are two minimal scripts I wrote to quantize, then requantize a Transformers model, while attempting to minimize loading time and memory usage:

Code to create a quantized state dict from a HF Transformers model

from transformers import AutoModelForCausalLM
from quanto import quantize, freeze, qint4, safe_save

model = AutoModelForCausalLM.from_pretrained(
    'codellama/CodeLlama-7b-Instruct-hf',
    torch_dtype='auto',
)

quantize(model, weights=qint4)
freeze(model)
safe_save(model.state_dict(), 'llama-7b.sd')

Code to requantize the model (for inference)

Note the usage of the meta, cpu, and cuda devices, along with the to_empty() function.

from transformers import AutoModelForCausalLM
from torch import device as torch_device
from quanto import quantize, safe_load
from torch.cuda import memory_allocated

meta = torch_device('meta')
cpu = torch_device('cpu')
gpu = torch_device('cuda:0')

with meta:
    model = AutoModelForCausalLM.from_pretrained(
        'codellama/CodeLlama-7b-Instruct-hf',
        torch_dtype='auto',
    )
    quantize(model)

model.to_empty(device=cpu)
state_dict = safe_load('llama-7b.sd')
model.load_state_dict(state_dict)
model.to(gpu)

print(f'cuda memory used in GB: {memory_allocated(gpu) / 1e9}')

These scripts load the Llama 7B model into ~4.17GB of VRAM, without any appreciable CPU RAM being used. This low memory usage is important, because that is frequently the reason people will be looking to use Quanto in the first place.

@dacorvo, would you please look at the second script above? If you think my methodology looks OK, I will turn it into a function and adapt it to your testing scheme.

dacorvo commented 5 months ago

@calmitchell617 yes that looks correct. Since the initial instantiation of the model happens outside of quanto (here using transformers), I am just wondering how you can enforce the whole sequence.

calmitchell617 commented 5 months ago

That's a valid concern. I will think about that when writing the function and test accordingly.

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 4 months ago

This issue was closed because it has been stalled for 5 days with no activity.