phelps-matthew / FeatherMap

Implementation of "Structured Multi-Hashing for Model Compression" (CVPR 2020)
MIT License
11 stars 3 forks source link

Quantization Aware Training + FeatherMap #12

Open varun19299 opened 3 years ago

varun19299 commented 3 years ago

If I wanted to use Quantization Aware Training (QAT) in conjunction with structured hashing, should I quantize before or after FeatherMap?

i.e, (before, intuitively seems correct to me):

# model has appropriate torch.quantization.DeQuantStub()
model = ResNet50()

model.train()
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
model_fp32_prepared = torch.quantization.prepare_qat(model_fp32)

# feather map stuff
model_fp32_prepared = FeatherNet(model_fp32_prepared, compress=0.10)

# train loop

# eval stuff
model_fp32_prepared.eval()
model_int8 = torch.quantization.convert(model_fp32_prepared)
model_int8.deploy()

To be specific, I'm interested in the low-rank decomposition being INT8.

phelps-matthew commented 3 years ago

Yep your intuition is correct - ideally one would aim to quantize during training (or emulate). I would surmise that post-quantization would perform poorly on a model that's been compressed via structured multi-hashing (SMH) due to the nonlinear mapping of weights to the reduced weight matrices (V1 and V2).

In theory, there shouldn't be any issue doing both SMH and quantization - however I haven't dug into the implementation internals of QAT to confirm whether its compatible with FeatherMap out the box. Let me know how it goes!

phelps-matthew commented 3 years ago

On further thought, I think one should apply QAT after wrapping in FeatherMap. FeatherNet as a layer will just expose self.V1 and self.V2 as the weights to be updated, which should then be quantized (or emulated quantization) and then trained. E.g, something like

base_model = ResNet50()
f_model = FeatherNet(base_model, compress=0.10)
# now apply quantization awareness to f_model and thus V1 and V2
# train
# evaluate and convert
varun19299 commented 3 years ago

Thank you for your reply!

How about evaluation: what order would it follow?

varun19299 commented 3 years ago

I'll also try comparing this method to iterative pruning ("To prune or not to prune", Zhu et al) and some dynamic sparse training techniques (RiGL, Evci et al. 2020).

phelps-matthew commented 3 years ago

Thank you for your reply!

How about evaluation: what order would it follow?

For accuracy and other metric evaluation you can make use of the GPU if you keep it in f_model.eval() mode. However, if you want to benchmark inference time, then you'd want to use f_model.deploy(). Presumably one would only need to actually go to reduced precision when deploying - if QAT can continue to emulate quantization during evaluation, I'd do that.

phelps-matthew commented 3 years ago

I'll also try comparing this method to iterative pruning ("To prune or not to prune", Zhu et al) and some dynamic sparse training techniques (RiGL, Evci et al. 2020).

Awesome. One of the cool things about FeatherMap is the ability to compound other compression methods. I'm very curious to see what kind of performance you might get compared to 'unstacked' compression methods.

varun19299 commented 3 years ago

For accuracy and other metric evaluation you can make use of the GPU if you keep it in f_model.eval() mode. However, if you want to benchmark inference time, then you'd want to use f_model.deploy(). Presumably one would only need to actually go to reduced precision when deploying - if QAT can continue to emulate quantization during evaluation, I'd do that.

I'm actually just interested in compressing the weights V_1, V_2. So I don't need to worry about eval? (model.state_dict should have V_1, V_2?).

phelps-matthew commented 3 years ago

Yes, the state_dict will save V1 and V2 as the weights, as well as the batchnorm layers