Delay the 'Quantize(Constant)->Constant' optim until after post-lowering.

tlepley-cadence commented 5 years ago

Currently, Constant are statically quantized early in the compilation flow. Basically the Quantize(Constant)->Constant optim happens in the glow::optimize() function.

There are multiple issues with this, potentially resulting into accuracy loss because the scale/offset of Constant may still have to change during the Glow compilation flow. By keeping Quantize nodes longer in the IR, scale/offset can be changed freely without resulting into any accuracy loss for constants.

In particular, the backend may at post-lowering:

introduce rescale nodes to handle specific constrains between node operands for the target, resulting to the need to requantize some constants after rescale propagation / optims.
do extra target specific optimizations resulting in the need to requantize some constants.

Just to illustrate with an example, the convolution bias is currently quantized to have a scale equal to (scale of input) * (scale of weights). We noticed that after post-lowering this equality was not necessary true anymore, due to some lowering & optimization combinations. the bias then needs to be requantized, what can be done without loss only if the Quantize node is still in the IR.

See: #2166

@beicy

beicy commented 5 years ago

If we delay this "Quantize(Constant)->Constant" utile after post-lowering, some backend may be affected. For example, after double checked with @opti-mix , some backend requires the weights/bias of a conv node should be constant. But if we leave this "Quantize(Constant)->Constant" opt after that, the weights/bias would be "QuantizeNode" which would cause verification issue later.

To address the accuracy loss issue, @opti-mix proposes the following approach:

void ExecutionEngine::optimizeFunction(CompilationMode mode, Function *F) {

  // Verify the function pre-optimization/lowering.                                                                                                                        
  assert(F->verify() && "Function must be valid");

  //***TODO*** add a function hooked to backend to allow the backend to do some 
  // special optimization based on the original graph. In this case, 
  //we can allow backends to adjust the scale/offset if they want.
  backend_->preOptimization(F, mode)  --- Added

  // Optimize the graph.                                                                                                                                                   
  ::glow::optimize(F, mode);

...
}

So, we don't have to separate the quantize opt from our optimization function, and our current backends won't be affected.

jfix71 commented 5 years ago

The source of the issue is that when we quantize a float Constant, we quantize the underlying data immediately and also change the output type to the Quantize node we are mering in.

I wonder if we could instead keep the underlying data as float, and just change the output type. Then at the end of compilation we would have a pass across the constants to quantize the underlying data using the final scale/offset. In cases where we access data during optimizations (e.g. deduplicating Constants, merging batch normalization into Conv's ...) then we could change the accessors there to quantize the floating point data on the fly.

tlepley-cadence commented 5 years ago

@beicy I don't think this would work.

First of all, the quantization process creates numerous Quantize/Dequantize nodes (basically one Quantize per node input and one Dequantize per node output) and some RescaleQuantized nodes. I think it's important to call glow::optimize()immediately after quantization, because the IR is not in a convenient state for being processed by the backend.
Last but not least, this is at lowering (from pre- to post-) that happen new IR transformations that can result into Constant scale/offsets change. For instance RescaleQuantizes nodes can be created when lowering to a generic or target specific node that has different scale/offset constraints between operands. These new RescaleQuantized nodes then get merged with other nodes what can impact the Constant operands of these nodes.

This is the reason why in PR #2166, I've kept the Quantize nodes of Constant until after post-lowering in order to make sure we get final scales/offsets before performing the actual conversion of constants from float to integer.

tlepley-cadence commented 5 years ago

@jfix71 I like the idea in principle, but in term of implementation, I have the feeling that it could open a can of worms. As I see it, the idea would be to keep the payload of a Constant as float while its type would be a quantized type. Accessing the constant payload would require the glow infrastructure to quantize (when reading) and dequantize (when writing) on the fly.

I see some complexity and potential issues, and I may miss some others. For instance:

The payload of a Constant node is a Tensor object. So it's not just about Constant, it's about the Tensor interface, what is a more global change. For instance, the existing direct access API (with getUnsafePtr()) is not compatible with an automatic translation. Instead, we would need something like OpenCL map/unmap for instance, or a specific smart pointer mechanics that would bring its own risks of doing mistakes when programming (because the actual change of the tensor values would be deferred).
Glow can also load pre-quantized networks. It means that Glow would need to manipulate two kinds of Constant (Quantized constant that has a float internal and quantized constant that have an integer internal). Or pre-quantized Constant would need to be converted to float internally. Isn't there then the risk to loose some accuracy by doing int->float->int conversion ?

At the end, there are currently not so many optimisations in Glow with Constant. This is why I took the approach of modifying optimizations, rather than trying to change the infrastructure to get something more generic immediately. I have to feeling that it is a pragmatic approach to solve the problem of quantization accuracy at short term. The deeper infrastructure change could be done in a second step.

jfix71 commented 5 years ago

@tlepley Yeah changing the Tensor is definitely not ideal. I was thinking of something like a ConstantHandle subclass of Handle, which would compare the ElemTys of the backing Tensor and the ConstantHandle and do a conversion (e.g. float->quantized) only if necessary (so pre-quantized Constants wouldn't need to be converted to float). Though this approach could be a bit error prone as you mention, since accessing a Constant's payload Tensor via getUnsafePtr() or normal Handles would be disallowed during graph optimizations.

Another option could be to keep the float Tensors around in case they are needed (e.g. in a map from Constants to their original float Tensors, or as a backup second Tensor in the Constant itself, etc.), and then free them after graph optimizations are complete. However this would mean the memory footprint during graph optimizations stays at peak for longer, as we would keep both float and quantized tensors until the end of optimizations. (Note peak memory usage is still about the same, as we currently have all float and quantized Tensors existing in parallel until DCE reaps the unused float Tensors.)

I don't believe either of these solutions would be super big infrastructure changes, so I personally prefer them to adding new logic to the graph optimizer.

beicy commented 5 years ago

@tlepley-cadence "This is the reason why in PR #2166, I've kept the Quantize nodes of Constant until after post-lowering " -- But as I mentioned, for some backend, the weights/bias must be constant before post-lowering, otherwise, there will be asset error. The propose's idea is to make sure the (Quantize(constant)->constant) happens before post-lowering.

ayermolo commented 5 years ago

Isn't it kind of dangerous assumption? It relies on a fact that a specific optimization is ran. Omitting an optimization pass shouldn't result in incorrect behavior or asserts.

beicy commented 5 years ago

But this optimization is under control. And backend can gain benefit from it.

ayermolo commented 5 years ago

Not sure what you mean by under control. Of course backends will benefit from it, but they shouldn't break, assert error, if it's omitted.

beicy commented 5 years ago

In general, the optimizations should be independent. However, due to the performance and some other restrictions, we need to make some assumptions.

nickgg commented 5 years ago

@jfix71 @tlepley-cadence is this still an issue?

jfix71 commented 5 years ago

Yes, I should hopefully be able to put up a fix for this sometime soon.

tlepley-cadence commented 5 years ago

Sorry, I missed the message. Yes, I confirm this is still an issue.

pytorch / glow

Delay the 'Quantize(Constant)->Constant' optim until after post-lowering. #2165