microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.78k stars 2.94k forks source link

[Quant Tool] Flaky test due to Pad reflect bug #22798

Closed adrianlizarraga closed 1 week ago

adrianlizarraga commented 1 week ago

Description

Fixes a unit test that would fail intermittently due to an existing bug with Pad (reflect mode). When the number of padded values is >= the inner dimension size, the ORT Pad implementation accesses invalid memory. This PR makes the number of padding values less than the inner dimension size to avoid triggering the bug.

Motivation and Context

See related issues: https://github.com/microsoft/onnxruntime/issues/8265 https://github.com/microsoft/onnxruntime/issues/11828 https://github.com/microsoft/onnxruntime/issues/20801

Here's a valgrind trace obtained on a Linux machine (with sess_options.enable_cpu_mem_arena = False)

==864228== Invalid read of size 4
==864228==    at 0x2716272A: void onnxruntime::PadInnermostAxis<unsigned int>(unsigned int*, unsigned int*, long, unsigned long) (pad.cc:370)
==864228==    by 0x2715D213: onnxruntime::common::Status onnxruntime::PadImpl<unsigned int>(onnxruntime::OpKernelContext*, absl::lts_20240722::InlinedVector<long, 10ul, std::allocator<long> > const&, absl::lts_20240722::InlinedVector<long, 10ul, std::allocator<long> > const&, onnxruntime::Mode const&, unsigned int) (pad.cc:551)
==864228==    by 0x2715B2BB: onnxruntime::Pad::Compute(onnxruntime::OpKernelContext*) const (pad.cc:725)
==864228==    by 0x276FF6A7: onnxruntime::ExecuteKernel(onnxruntime::StreamExecutionContext&, unsigned long, unsigned long, bool const&, onnxruntime::SessionScope&) (sequential_executor.cc:484)
==864228==    by 0x276F4A04: onnxruntime::LaunchKernelStep::Execute(onnxruntime::StreamExecutionContext&, unsigned long, onnxruntime::SessionScope&, bool const&, bool&) (execution_steps.cc:73)
...

The above is obtained with the basic Pad(reflect) example on the ONNX Pad operator spec page:

data = [
    [1.0, 1.2],
    [2.3, 3.4],
    [4.5, 5.7],
]

pads = [0, 2, 0, 0]

mode = 'reflect'

# Expected output by ONNX spec
expected_output = [
    [1.0, 1.2, 1.0, 1.2],
    [2.3, 3.4, 2.3, 3.4],
    [4.5, 5.7, 4.5, 5.7],
]

# Bugged output from onnxruntime has invalid/uninitialized data for the first element in the inner dimension
# invalid data may be 0.0, inf, nan, etc.
ort_output = [
    [inf, 1.2, 1.0, 1.2],
    [inf, 3.4, 2.3, 3.4],
    [inf, 5.7, 4.5, 5.7],
]