Open tlepley-cadence opened 5 years ago
If we delay this "Quantize(Constant)->Constant" utile after post-lowering, some backend may be affected. For example, after double checked with @opti-mix , some backend requires the weights/bias of a conv node should be constant. But if we leave this "Quantize(Constant)->Constant" opt after that, the weights/bias would be "QuantizeNode" which would cause verification issue later.
To address the accuracy loss issue, @opti-mix proposes the following approach:
void ExecutionEngine::optimizeFunction(CompilationMode mode, Function *F) {
// Verify the function pre-optimization/lowering.
assert(F->verify() && "Function must be valid");
//***TODO*** add a function hooked to backend to allow the backend to do some
// special optimization based on the original graph. In this case,
//we can allow backends to adjust the scale/offset if they want.
backend_->preOptimization(F, mode) --- Added
// Optimize the graph.
::glow::optimize(F, mode);
...
}
So, we don't have to separate the quantize opt from our optimization function, and our current backends won't be affected.
The source of the issue is that when we quantize a float Constant, we quantize the underlying data immediately and also change the output type to the Quantize node we are mering in.
I wonder if we could instead keep the underlying data as float, and just change the output type. Then at the end of compilation we would have a pass across the constants to quantize the underlying data using the final scale/offset. In cases where we access data during optimizations (e.g. deduplicating Constants, merging batch normalization into Conv's ...) then we could change the accessors there to quantize the floating point data on the fly.
@beicy I don't think this would work.
Quantize
/Dequantize
nodes (basically one Quantize
per node input and one Dequantize
per node output) and some RescaleQuantized
nodes. I think it's important to call glow::optimize()
immediately after quantization, because the IR is not in a convenient state for being processed by the backend. Constant
scale/offsets change. For instance RescaleQuantizes
nodes can be created when lowering to a generic or target specific node that has different scale/offset constraints between operands. These new RescaleQuantized
nodes then get merged with other nodes what can impact the Constant operands of these nodes.This is the reason why in PR #2166, I've kept the Quantize
nodes of Constant
until after post-lowering in order to make sure we get final scales/offsets before performing the actual conversion of constants from float to integer.
@jfix71 I like the idea in principle, but in term of implementation, I have the feeling that it could open a can of worms. As I see it, the idea would be to keep the payload of a Constant
as float while its type would be a quantized type. Accessing the constant payload would require the glow infrastructure to quantize (when reading) and dequantize (when writing) on the fly.
I see some complexity and potential issues, and I may miss some others. For instance:
Constant
node is a Tensor
object. So it's not just about Constant
, it's about the Tensor
interface, what is a more global change. For instance, the existing direct access API (with getUnsafePtr()
) is not compatible with an automatic translation. Instead, we would need something like OpenCL map/unmap for instance, or a specific smart pointer mechanics that would bring its own risks of doing mistakes when programming (because the actual change of the tensor values would be deferred).At the end, there are currently not so many optimisations in Glow with Constant
. This is why I took the approach of modifying optimizations, rather than trying to change the infrastructure to get something more generic immediately. I have to feeling that it is a pragmatic approach to solve the problem of quantization accuracy at short term. The deeper infrastructure change could be done in a second step.
@tlepley Yeah changing the Tensor is definitely not ideal. I was thinking of something like a ConstantHandle subclass of Handle, which would compare the ElemTys of the backing Tensor and the ConstantHandle and do a conversion (e.g. float->quantized
) only if necessary (so pre-quantized Constants wouldn't need to be converted to float). Though this approach could be a bit error prone as you mention, since accessing a Constant's payload Tensor via getUnsafePtr()
or normal Handles would be disallowed during graph optimizations.
Another option could be to keep the float Tensors around in case they are needed (e.g. in a map from Constants to their original float Tensors, or as a backup second Tensor in the Constant itself, etc.), and then free them after graph optimizations are complete. However this would mean the memory footprint during graph optimizations stays at peak for longer, as we would keep both float and quantized tensors until the end of optimizations. (Note peak memory usage is still about the same, as we currently have all float and quantized Tensors existing in parallel until DCE reaps the unused float Tensors.)
I don't believe either of these solutions would be super big infrastructure changes, so I personally prefer them to adding new logic to the graph optimizer.
@tlepley-cadence "This is the reason why in PR #2166, I've kept the Quantize nodes of Constant until after post-lowering " -- But as I mentioned, for some backend, the weights/bias must be constant before post-lowering, otherwise, there will be asset error. The propose's idea is to make sure the (Quantize(constant)->constant) happens before post-lowering.
Isn't it kind of dangerous assumption? It relies on a fact that a specific optimization is ran. Omitting an optimization pass shouldn't result in incorrect behavior or asserts.
But this optimization is under control. And backend can gain benefit from it.
Not sure what you mean by under control. Of course backends will benefit from it, but they shouldn't break, assert error, if it's omitted.
In general, the optimizations should be independent. However, due to the performance and some other restrictions, we need to make some assumptions.
@jfix71 @tlepley-cadence is this still an issue?
Yes, I should hopefully be able to put up a fix for this sometime soon.
Related to/unblocks part of https://github.com/pytorch/glow/issues/2481#issuecomment-470343590
Sorry, I missed the message. Yes, I confirm this is still an issue.
Currently,
Constant
are statically quantized early in the compilation flow. Basically theQuantize(Constant)->Constant
optim happens in theglow::optimize()
function.There are multiple issues with this, potentially resulting into accuracy loss because the scale/offset of
Constant
may still have to change during the Glow compilation flow. By keepingQuantize
nodes longer in the IR, scale/offset can be changed freely without resulting into any accuracy loss for constants.In particular, the backend may at post-lowering:
Just to illustrate with an example, the convolution bias is currently quantized to have a scale equal to (scale of input) * (scale of weights). We noticed that after post-lowering this equality was not necessary true anymore, due to some lowering & optimization combinations. the bias then needs to be requantized, what can be done without loss only if the
Quantize
node is still in the IR.See: #2166
@beicy