mit-han-lab / hardware-aware-transformers

[ACL'20] HAT: Hardware-Aware Transformers for Efficient Natural Language Processing
https://hat.mit.edu
Other
329 stars 50 forks source link

Quantization on HAT. #3

Closed sugeeth14 closed 4 years ago

sugeeth14 commented 4 years ago

Hi, firstly thanks for the library. I tried a few experiments and trained sub-transformer with latency constraint of 75 ms. It is mentioned in the paper that HAT is quantization friendly. Now I want to quantize the subtransformer I trained. Can you please share on how you quantized the subtransformer ? It would be helpful. Also is there any comparison with respect to latency when quantized. Can we expect latency reduction here ? Thanks.

Hanrui-Wang commented 4 years ago

Hi Sugeeth,

Thanks for you question, we perform one-shot k-means quantization as below: Load the model in Pytorch and then perform K-mean quantization for each weight matrix. Then store the model to perform inference testing.

The core part is like:

from sklearn.cluster import KMeans

for key, value in model.items():
    if not 'embed' in key and not 'norm' in key and not 'bias' in key:
        shape = value.shape
        val = value.cpu().numpy().astype('float32').reshape(-1, 1)
        max_val = max(val)
        min_val = min(val)

        inits = np.linspace(min_val[0], max_val[0], num=2**bits).reshape(-1, 1)

        kmeans = KMeans(n_clusters=2**bits, init=inits, random_state=0, verbose=0, max_iter=args.iter, n_jobs=-1).fit(val)

        centers = kmeans.cluster_centers_
        predicts = kmeans.predict(val)
        quantized = [centers[predicts[i]][0] for i in range(len(predicts))]
        quantized_tensor = torch.tensor(np.array(quantized), device='cuda:0', dtype=torch.float16).reshape(shape)
        quantized_model[key] = quantized_tensor

We don't expect latency reduction on CPU/GPU/Rasp pi. But we think on dedicated hardware accelerators, the quantization can bring latency reductions.

Thanks!

sugeeth14 commented 4 years ago

Hi I loaded the subtransformer model I trained in pytorch, tried to quantize but the final quantized model also is the same size as the original model. This is what I tried

import torch
import argparse
from sklearn.cluster import KMeans
import numpy as np

def main(args):
    model = torch.load(args.model) #The subtransformer model is loaded

    #  model has the following keys 
    #  >>> model.keys()
    #  dict_keys(['args', 'model', 'optimizer_history', 'extra_state', 'last_optimizer_state'])

    quantized_model = model
    for key, value in model['model'].items():
        if not 'embed' in key and not 'norm' in key and not 'bias' in key:
            shape = value.shape
            val = value.cpu().numpy().astype('float32').reshape(-1, 1)
            max_val = max(val)
            min_val = min(val)
            bits = 4

            inits = np.linspace(min_val[0], max_val[0], num=2**bits).reshape(-1, 1)

            # Try and pass because in few cases I was getting value_error that n_samples >= n_clusters.
            try:

                kmeans = KMeans(n_clusters=2**bits, init=inits, random_state=0, verbose=0, max_iter=300, n_jobs=-1).fit(val)

                centers = kmeans.cluster_centers_
                predicts = kmeans.predict(val)
                quantized = [centers[predicts[i]][0] for i in range(len(predicts))]
                quantized_tensor = torch.tensor(np.array(quantized), device='cuda:0', dtype=torch.int8).reshape(shape)
                quantized_model['model'][key] = quantized_tensor
            except:
                pass
    torch.save(quantized_model, args.q_model)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('-model', help='Subtransformer model to quantize')
    parser.add_argument('-q_model', help='Path to save quantized model')
    args = parser.parse_args()
    main(args)

I also tried bits=2 and dtype=torch.int8 but no difference. Am I missing something here ? how to achieve the 4-bit and the 8-bit quantization, are the bits mentioned here the same ? Please help. Thanks.

Hanrui-Wang commented 4 years ago

Hi Sugeeth,

Let me clarify, the code I shared is used to test the accuracy of the quantized model, not to measure the size because the code performs the dequantization before storing the model. To reduce the model size in the file system, only storing 'centers' and 'predicts' is enough. And please make sure to store them in the 'int' data type instead of the 'floating' type. In this way, changing the 'bits' will change the real model file size.

''' centers = kmeans.clustercenters predicts = kmeans.predict(val) ''' In this case, the model dequantization needs to be done in the testing part.

Hi I loaded the subtransformer model I trained in pytorch, tried to quantize but the final quantized model also is the same size as the original model. This is what I tried

import torch
import argparse
from sklearn.cluster import KMeans
import numpy as np

def main(args):
    model = torch.load(args.model) #The subtransformer model is loaded

    #  model has the following keys 
    #  >>> model.keys()
    #  dict_keys(['args', 'model', 'optimizer_history', 'extra_state', 'last_optimizer_state'])

    quantized_model = model
    for key, value in model['model'].items():
        if not 'embed' in key and not 'norm' in key and not 'bias' in key:
            shape = value.shape
            val = value.cpu().numpy().astype('float32').reshape(-1, 1)
            max_val = max(val)
            min_val = min(val)
            bits = 4

            inits = np.linspace(min_val[0], max_val[0], num=2**bits).reshape(-1, 1)

            # Try and pass because in few cases I was getting value_error that n_samples >= n_clusters.
            try:

                kmeans = KMeans(n_clusters=2**bits, init=inits, random_state=0, verbose=0, max_iter=300, n_jobs=-1).fit(val)

                centers = kmeans.cluster_centers_
                predicts = kmeans.predict(val)
                quantized = [centers[predicts[i]][0] for i in range(len(predicts))]
                quantized_tensor = torch.tensor(np.array(quantized), device='cuda:0', dtype=torch.int8).reshape(shape)
                quantized_model['model'][key] = quantized_tensor
            except:
                pass
    torch.save(quantized_model, args.q_model)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('-model', help='Subtransformer model to quantize')
    parser.add_argument('-q_model', help='Path to save quantized model')
    args = parser.parse_args()
    main(args)

I also tried bits=2 and dtype=torch.int8 but no difference. Am I missing something here ? how to achieve the 4-bit and the 8-bit quantization, are the bits mentioned here the same ? Please help. Thanks.

Hanrui-Wang commented 4 years ago

I'm closing it for now, feel free to reopen it for any further questions.