openvinotoolkit / openvino

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
https://docs.openvino.ai
Apache License 2.0
6.78k stars 2.17k forks source link

[Bug]: Wrong Softplus AC for values smaller than -10 #23673

Open cold-blue opened 5 months ago

cold-blue commented 5 months ago

OpenVINO Version

2024.0.0-14509-34caeefd078-releases/2024/0

Operating System

Ubuntu 20.04 (LTS)

Device used for inference

CPU

Framework

PyTorch

Model used

N/A

Issue description

Got wrong outputs of a recurrent model after 10+ iterations from OpenVINO compared to torch due to cumulative errors. Reproduced the relative differences using the following short reproducer model. The problem might be related to the different implementation of OV Softplus from Torch Softplus. In my model, a large number is multiplied by the softplus and the result will be passed to the next iteration step, so the error is amplified and cumulated.

Step-by-step reproduction

Using the following script to reproduce:

import numpy as np
import torch
import openvino as ov
class TestModel(torch.nn.Module):
    def __init__(self, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
    def forward(self, x):
        out = torch.nn.functional.softplus(x)
        # out = torch.where(x > 20, x, torch.log(1. + (1./torch.exp(-x))))
        # out = torch.log(1 + torch.exp(x))
        out = torch.log1p(torch.exp(x))
        # out = torch.log((1/torch.exp(-x)) + 1)
        return out
input_data = np.array([[-29.100761, -25.968552, -25.956476, -25.350357, -23.834906,
                        -23.60192, -20.378632, -20.233223, -20.217833, -20.202318, -10.]])
torch_model = TestModel()
ov_model = ov.convert_model(torch_model, example_input=[input_data])
ov_compiled_model = ov.compile_model(ov_model, device_name="CPU")
torch_output = torch_model(torch.tensor(input_data)).numpy()
ov_output = ov_compiled_model([input_data])[0]
diffs = np.abs(ov_output - torch_output) / np.abs(torch_output)
print("torch out: ", torch_output)
print("ov out:    ", ov_output)
print("diffs: ", diffs)

Relevant log output

torch out:  [[2.29985300e-13 5.27231274e-12 5.33636717e-12 9.78317542e-12 4.45278625e-11 5.62103619e-11 1.41147299e-09 1.63238614e-09 1.65770288e-09 1.68362269e-09]]

ov out:     [[-1.90734863e-06 -1.90734863e-06 -1.90734863e-06 -1.90734863e-06 -1.90734863e-06 -1.90734863e-06 -1.90734863e-06 -1.90734863e-06 -1.90734863e-06 -1.90734863e-06]]

relative diffs:  [[8.29335119e+06 3.61767975e+05 3.57425550e+05 1.94963121e+05 4.28359471e+04 3.39333315e+04 1.35231784e+03 1.16944206e+03 1.15159741e+03 1.13388366e+03]]

Issue submission checklist

mvafin commented 5 months ago

Looks like Softplus in our opset and in torch behave differently, but the difference is below threshold 1e-4 which we usually use. @cold-blue do you have a model where this is important?

cold-blue commented 5 months ago

Looks like Softplus in our opset and in torch behave differently, but the difference is below threshold 1e-4 which we usually use. @cold-blue do you have a model where this is important?

I am trying to convert Mamba LLM into Openvino model and find this problem. image As you see, after softplus, dt is multiplied by A (large digits around 10^4). And the following updated ssm_state will be passed to the iteration, so that the error is cumulated. This model is important for LLM deployment on client, so please help me solve that. Thank you very much!

cold-blue commented 5 months ago

After further debugging, this error can be worked around by replacing softplus by "torch.log(torch.exp(x).unsqueeze(0) + 1).squeeze(0)". My data range is between (-29, 23) and the error may not caused by the numbers that I provided but other unknown ones.

mvafin commented 4 months ago

@cold-blue So it would be better if the operation return 0. GPU seem to behave as expected, it returns 0 for softplus. I will assign CPU to look on this issue.

wenjiew commented 4 months ago

@mlukasze Could your team member help take a look whether this is something we can address?

mlukasze commented 4 months ago

yep, we will take a look