Question about Mixed Precision DNNs

sony / ai-research-code

Apache License 2.0

347 stars 65 forks source link

Question about Mixed Precision DNNs #60

Closed PuNeal closed 1 year ago

PuNeal commented 2 years ago

hello, I have a question about the implementations of CASE U3: why the formula of calculating bitwidth is different between weight and activation? https://github.com/sony/ai-research-code/blob/master/mixed-precision-dnns/train_resnet.py#L174 https://github.com/sony/ai-research-code/blob/master/mixed-precision-dnns/train_resnet.py#L264

Hope for reply. Thanks!

FabienCardinaux commented 1 year ago

Hello, Thank you for your interest. Actually the two lines that you mentioned refer to two different setups. Namely fixed-point parametrized with d and xmax for L174 and pow2 parametrized by x_min and x_max for L264.
The comparison should be between L174 and L252 with the same setup. The only difference is +1 in the weight computation. This corresponds to the sign bit which is needed for the sign of the weight. On the other hand, the activation do not require the sign bit when using a ReLU activation function.

Hope this answers your question.

PuNeal commented 1 year ago

Got it, thank you!

PuNeal commented 1 year ago

Hi, I'm confused about the backward process during training, as the code below:

    xmax = clip_scalar(xmax, xmax_min, xmax_max)

    # compute min/max value that we can represent
    if sign:
        xmin = -xmax
    else:
        xmin = nn.Variable((1,), need_grad=False)
        xmin.d = 0.

    # broadcast variables to correct size
    d = broadcast_scalar(d, shape=x.shape)
    xmin = broadcast_scalar(xmin, shape=x.shape)
    xmax = broadcast_scalar(xmax, shape=x.shape)

    # apply fixed-point quantization
    return d * F.round(F.clip_by_value(x, xmin, xmax) / d)

xmax is a parameter learnable and used to clamp x. Then how to compute the gradient produced by F.clip_by_value ? or is the gradient of xmax produced by compressing of weights/activations? Thank you.

StefanUhlich-sony commented 1 year ago

Hello @PuNeal, the gradients are backpropagated by both: F.clip_by_value inside the quantizer and also the weight/activation size penalty from the loss function.

For F.clip_by_value the gradients are backpropagated according to the value of x as the function is defined as https://github.com/sony/nnabla/blob/7e9e97023ca89bf2056d7b7310c15a050ca438b6/python/src/nnabla/functions.py#L695-L729

If x < min, then maximum2 is clipping the value to min and the gradient is flowing to the min argument of F.clip_by_value (please see https://github.com/sony/nnabla/blob/master/include/nbla/function/maximum2.hpp#L41-L42).
If x > max, then minimum2 is clipping the value to max and the gradient is flowing to the max argument of F.clip_by_value (please see https://github.com/sony/nnabla/blob/master/include/nbla/function/minimum2.hpp#L41-L42).

PuNeal commented 1 year ago

@TE-StefanUhlich Thanks for your reply.