pytorch / glow

Compiler for Neural Network hardware accelerators
Apache License 2.0
3.21k stars 688 forks source link

Delay the 'Quantize(Constant)->Constant' optim until after post-lowering. #2165

Open tlepley-cadence opened 5 years ago

tlepley-cadence commented 5 years ago

Currently, Constant are statically quantized early in the compilation flow. Basically the Quantize(Constant)->Constant optim happens in the glow::optimize() function.

There are multiple issues with this, potentially resulting into accuracy loss because the scale/offset of Constant may still have to change during the Glow compilation flow. By keeping Quantize nodes longer in the IR, scale/offset can be changed freely without resulting into any accuracy loss for constants.

In particular, the backend may at post-lowering:

Just to illustrate with an example, the convolution bias is currently quantized to have a scale equal to (scale of input) * (scale of weights). We noticed that after post-lowering this equality was not necessary true anymore, due to some lowering & optimization combinations. the bias then needs to be requantized, what can be done without loss only if the Quantize node is still in the IR.

See: #2166

@beicy

beicy commented 5 years ago

If we delay this "Quantize(Constant)->Constant" utile after post-lowering, some backend may be affected. For example, after double checked with @opti-mix , some backend requires the weights/bias of a conv node should be constant. But if we leave this "Quantize(Constant)->Constant" opt after that, the weights/bias would be "QuantizeNode" which would cause verification issue later.

To address the accuracy loss issue, @opti-mix proposes the following approach:

void ExecutionEngine::optimizeFunction(CompilationMode mode, Function *F) {

  // Verify the function pre-optimization/lowering.                                                                                                                        
  assert(F->verify() && "Function must be valid");

  //***TODO*** add a function hooked to backend to allow the backend to do some 
  // special optimization based on the original graph. In this case, 
  //we can allow backends to adjust the scale/offset if they want.
  backend_->preOptimization(F, mode)  --- Added

  // Optimize the graph.                                                                                                                                                   
  ::glow::optimize(F, mode);

...
}

So, we don't have to separate the quantize opt from our optimization function, and our current backends won't be affected.

jfix71 commented 5 years ago

The source of the issue is that when we quantize a float Constant, we quantize the underlying data immediately and also change the output type to the Quantize node we are mering in.

I wonder if we could instead keep the underlying data as float, and just change the output type. Then at the end of compilation we would have a pass across the constants to quantize the underlying data using the final scale/offset. In cases where we access data during optimizations (e.g. deduplicating Constants, merging batch normalization into Conv's ...) then we could change the accessors there to quantize the floating point data on the fly.

tlepley-cadence commented 5 years ago

@beicy I don't think this would work.

This is the reason why in PR #2166, I've kept the Quantize nodes of Constant until after post-lowering in order to make sure we get final scales/offsets before performing the actual conversion of constants from float to integer.

tlepley-cadence commented 5 years ago

@jfix71 I like the idea in principle, but in term of implementation, I have the feeling that it could open a can of worms. As I see it, the idea would be to keep the payload of a Constant as float while its type would be a quantized type. Accessing the constant payload would require the glow infrastructure to quantize (when reading) and dequantize (when writing) on the fly.

I see some complexity and potential issues, and I may miss some others. For instance:

At the end, there are currently not so many optimisations in Glow with Constant. This is why I took the approach of modifying optimizations, rather than trying to change the infrastructure to get something more generic immediately. I have to feeling that it is a pragmatic approach to solve the problem of quantization accuracy at short term. The deeper infrastructure change could be done in a second step.

jfix71 commented 5 years ago

@tlepley Yeah changing the Tensor is definitely not ideal. I was thinking of something like a ConstantHandle subclass of Handle, which would compare the ElemTys of the backing Tensor and the ConstantHandle and do a conversion (e.g. float->quantized) only if necessary (so pre-quantized Constants wouldn't need to be converted to float). Though this approach could be a bit error prone as you mention, since accessing a Constant's payload Tensor via getUnsafePtr() or normal Handles would be disallowed during graph optimizations.

Another option could be to keep the float Tensors around in case they are needed (e.g. in a map from Constants to their original float Tensors, or as a backup second Tensor in the Constant itself, etc.), and then free them after graph optimizations are complete. However this would mean the memory footprint during graph optimizations stays at peak for longer, as we would keep both float and quantized tensors until the end of optimizations. (Note peak memory usage is still about the same, as we currently have all float and quantized Tensors existing in parallel until DCE reaps the unused float Tensors.)

I don't believe either of these solutions would be super big infrastructure changes, so I personally prefer them to adding new logic to the graph optimizer.

beicy commented 5 years ago

@tlepley-cadence "This is the reason why in PR #2166, I've kept the Quantize nodes of Constant until after post-lowering " -- But as I mentioned, for some backend, the weights/bias must be constant before post-lowering, otherwise, there will be asset error. The propose's idea is to make sure the (Quantize(constant)->constant) happens before post-lowering.

ayermolo commented 5 years ago

Isn't it kind of dangerous assumption? It relies on a fact that a specific optimization is ran. Omitting an optimization pass shouldn't result in incorrect behavior or asserts.

beicy commented 5 years ago

But this optimization is under control. And backend can gain benefit from it.

ayermolo commented 5 years ago

Not sure what you mean by under control. Of course backends will benefit from it, but they shouldn't break, assert error, if it's omitted.

beicy commented 5 years ago

In general, the optimizations should be independent. However, due to the performance and some other restrictions, we need to make some assumptions.

nickgg commented 5 years ago

@jfix71 @tlepley-cadence is this still an issue?

jfix71 commented 5 years ago

Yes, I should hopefully be able to put up a fix for this sometime soon.

Related to/unblocks part of https://github.com/pytorch/glow/issues/2481#issuecomment-470343590

tlepley-cadence commented 5 years ago

Sorry, I missed the message. Yes, I confirm this is still an issue.