How can I achieve stochastic rounding noise during inference time?

Hi, I want to measure the quantization noise during inference time when the rounding mode is put to 'stochastic'. However, I am having a hard time constructing such stochastic noise, as the outputs of my networks and layers all seem to be constant. Any idea what I can do?

Here is a minimal code example:

import torch
from aimet_torch.quantsim import QuantizationSimModel
from aimet_common.defs import QuantScheme
model = torch.nn.Sequential(torch.nn.Linear(10,4),torch.nn.ReLU(),torch.nn.Linear(4,1))
dummy_input = torch.rand(10)
quantsim = QuantizationSimModel(model=model, quant_scheme=QuantScheme.training_range_learning_with_tf_enhanced_init,
                                dummy_input=dummy_input, rounding_mode='stochastic',
                                default_output_bw=4, default_param_bw=4, in_place=False)
def evaluate(model,args):
    #just some meaningless eval function so that encodings can be computed
    total=0
    for _ in range(10):
        total+=model(torch.rand(10))
    return total-1
quantsim.compute_encodings(forward_pass_callback=lambda _model,args: evaluate(_model,args),
                           forward_pass_callback_args={})
outputs = torch.Tensor()
quantsim.model.eval()
for _ in range(10):
    output = quantsim.model(dummy_input)
    outputs = torch.cat((outputs,output[None,...]),0)
print(outputs) # I would have wanted these to be non-constant

I did notice that with a different quantization scheme (e.g. post_training_tf) and omitting quantsim.model.eval() I could achieve what I wanted, but I am actually interested in getting this stochastic behavior for a model that has been trained using QAT.

Kind regards,

Winfried

quic / aimet

How can I achieve stochastic rounding noise during inference time? #1988