Closed sugeeth14 closed 4 years ago
Hi Sugeeth,
Thanks for you question, we perform one-shot k-means quantization as below: Load the model in Pytorch and then perform K-mean quantization for each weight matrix. Then store the model to perform inference testing.
The core part is like:
from sklearn.cluster import KMeans
for key, value in model.items():
if not 'embed' in key and not 'norm' in key and not 'bias' in key:
shape = value.shape
val = value.cpu().numpy().astype('float32').reshape(-1, 1)
max_val = max(val)
min_val = min(val)
inits = np.linspace(min_val[0], max_val[0], num=2**bits).reshape(-1, 1)
kmeans = KMeans(n_clusters=2**bits, init=inits, random_state=0, verbose=0, max_iter=args.iter, n_jobs=-1).fit(val)
centers = kmeans.cluster_centers_
predicts = kmeans.predict(val)
quantized = [centers[predicts[i]][0] for i in range(len(predicts))]
quantized_tensor = torch.tensor(np.array(quantized), device='cuda:0', dtype=torch.float16).reshape(shape)
quantized_model[key] = quantized_tensor
We don't expect latency reduction on CPU/GPU/Rasp pi. But we think on dedicated hardware accelerators, the quantization can bring latency reductions.
Thanks!
Hi I loaded the subtransformer model I trained in pytorch, tried to quantize but the final quantized model also is the same size as the original model. This is what I tried
import torch
import argparse
from sklearn.cluster import KMeans
import numpy as np
def main(args):
model = torch.load(args.model) #The subtransformer model is loaded
# model has the following keys
# >>> model.keys()
# dict_keys(['args', 'model', 'optimizer_history', 'extra_state', 'last_optimizer_state'])
quantized_model = model
for key, value in model['model'].items():
if not 'embed' in key and not 'norm' in key and not 'bias' in key:
shape = value.shape
val = value.cpu().numpy().astype('float32').reshape(-1, 1)
max_val = max(val)
min_val = min(val)
bits = 4
inits = np.linspace(min_val[0], max_val[0], num=2**bits).reshape(-1, 1)
# Try and pass because in few cases I was getting value_error that n_samples >= n_clusters.
try:
kmeans = KMeans(n_clusters=2**bits, init=inits, random_state=0, verbose=0, max_iter=300, n_jobs=-1).fit(val)
centers = kmeans.cluster_centers_
predicts = kmeans.predict(val)
quantized = [centers[predicts[i]][0] for i in range(len(predicts))]
quantized_tensor = torch.tensor(np.array(quantized), device='cuda:0', dtype=torch.int8).reshape(shape)
quantized_model['model'][key] = quantized_tensor
except:
pass
torch.save(quantized_model, args.q_model)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('-model', help='Subtransformer model to quantize')
parser.add_argument('-q_model', help='Path to save quantized model')
args = parser.parse_args()
main(args)
I also tried bits=2
and dtype=torch.int8
but no difference. Am I missing something here ? how to achieve the 4-bit and the 8-bit quantization, are the bits
mentioned here the same ? Please help. Thanks.
Hi Sugeeth,
Let me clarify, the code I shared is used to test the accuracy of the quantized model, not to measure the size because the code performs the dequantization before storing the model. To reduce the model size in the file system, only storing 'centers' and 'predicts' is enough. And please make sure to store them in the 'int' data type instead of the 'floating' type. In this way, changing the 'bits' will change the real model file size.
''' centers = kmeans.clustercenters predicts = kmeans.predict(val) ''' In this case, the model dequantization needs to be done in the testing part.
Hi I loaded the subtransformer model I trained in pytorch, tried to quantize but the final quantized model also is the same size as the original model. This is what I tried
import torch import argparse from sklearn.cluster import KMeans import numpy as np def main(args): model = torch.load(args.model) #The subtransformer model is loaded # model has the following keys # >>> model.keys() # dict_keys(['args', 'model', 'optimizer_history', 'extra_state', 'last_optimizer_state']) quantized_model = model for key, value in model['model'].items(): if not 'embed' in key and not 'norm' in key and not 'bias' in key: shape = value.shape val = value.cpu().numpy().astype('float32').reshape(-1, 1) max_val = max(val) min_val = min(val) bits = 4 inits = np.linspace(min_val[0], max_val[0], num=2**bits).reshape(-1, 1) # Try and pass because in few cases I was getting value_error that n_samples >= n_clusters. try: kmeans = KMeans(n_clusters=2**bits, init=inits, random_state=0, verbose=0, max_iter=300, n_jobs=-1).fit(val) centers = kmeans.cluster_centers_ predicts = kmeans.predict(val) quantized = [centers[predicts[i]][0] for i in range(len(predicts))] quantized_tensor = torch.tensor(np.array(quantized), device='cuda:0', dtype=torch.int8).reshape(shape) quantized_model['model'][key] = quantized_tensor except: pass torch.save(quantized_model, args.q_model) if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument('-model', help='Subtransformer model to quantize') parser.add_argument('-q_model', help='Path to save quantized model') args = parser.parse_args() main(args)
I also tried
bits=2
anddtype=torch.int8
but no difference. Am I missing something here ? how to achieve the 4-bit and the 8-bit quantization, are thebits
mentioned here the same ? Please help. Thanks.
I'm closing it for now, feel free to reopen it for any further questions.
Hi, firstly thanks for the library. I tried a few experiments and trained sub-transformer with latency constraint of 75 ms. It is mentioned in the paper that HAT is quantization friendly. Now I want to quantize the subtransformer I trained. Can you please share on how you quantized the subtransformer ? It would be helpful. Also is there any comparison with respect to latency when quantized. Can we expect latency reduction here ? Thanks.