tenstorrent / tt-mlir

Tenstorrent MLIR compiler
https://tenstorrent.github.io/tt-mlir/
Apache License 2.0
51 stars 7 forks source link

ttrt should return non-zero exit status when binary fails #465

Open kmabeeTT opened 3 weeks ago

kmabeeTT commented 3 weeks ago

Small request - I'm on few days old (8/18) commit 150e466, but looks like "ttrt run" will return passing (0) exit status even when "inputs and outputs do not match in binary" error is encountered, which makes running test in loop (to detect fails) not easy. Can it be made to propagate error so that some non-zero exit code is returned to caller?

Example:

% ./build/bin/ttmlir-opt --ttir-load-system-desc="path=n300.ttsys" --ttir-implicit-device --ttir-allocate --convert-ttir-to-ttmetal --ttmetal-serialize-to-binary="output=out.ttm" test/ttmlir/Dialect/TTMetal/to_layout.mlir
% ttrt run --identity --atol 1e-02 out.ttm

<snip>

2024-08-21 18:03:13,329 - DEBUG - evaluating program=0 for binary=out.ttm
2024-08-21 18:03:13,330 - DEBUG - generating inputs/outputs for loop=1/1 for binary=out.ttm
2024-08-21 18:03:13,330 - DEBUG - starting loop=1/1 for binary=out.ttm
2024-08-21 18:03:13,330 - DEBUG - finished loop=1/1 for binary=out.ttm
2024-08-21 18:03:13,330 - DEBUG - checking identity with rtol=1e-05 and atol=0.01
2024-08-21 18:03:13,330 - ERROR - Failed: inputs and outputs do not match in binary
2024-08-21 18:03:13,330 - ERROR - tensor([[-1.1258, -1.1524, -0.2506,  ...,  1.1648,  0.9234,  1.3873],
        [-0.8834, -0.4189, -0.8048,  ...,  0.1447,  1.9029,  0.3904],
        [-0.0394, -0.8015, -0.4955,  ...,  0.5541, -0.1817, -0.2345],
        ...,
        [ 0.3735,  2.6150,  0.1530,  ...,  1.0498, -0.2760, -2.1163],
        [-1.2005,  1.4457,  0.1172,  ..., -0.5705, -0.8428, -1.2050],
        [-1.6555,  0.7469,  1.6022,  ..., -0.7476, -1.0687, -0.1856]])
2024-08-21 18:03:13,331 - DEBUG - input tensors for program=0
2024-08-21 18:03:13,332 - DEBUG - tensor([[-1.1258, -1.1524, -0.2506,  ...,  1.1648,  0.9234,  1.3873],
        [-0.8834, -0.4189, -0.8048,  ...,  0.1447,  1.9029,  0.3904],
        [-0.0394, -0.8015, -0.4955,  ...,  0.5541, -0.1817, -0.2345],
        ...,
        [ 0.3735,  2.6150,  0.1530,  ...,  1.0498, -0.2760, -2.1163],
        [-1.2005,  1.4457,  0.1172,  ..., -0.5705, -0.8428, -1.2050],
        [-1.6555,  0.7469,  1.6022,  ..., -0.7476, -1.0687, -0.1856]])

2024-08-21 18:03:13,332 - DEBUG - output tensors for program=0
2024-08-21 18:03:13,332 - DEBUG - tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

2024-08-21 18:03:13,332 - DEBUG - finished executing ttmetal binaries
2024-08-21 18:03:13,332 - DEBUG - finished executing run API
2024-08-21 18:03:13,332 - DEBUG - postprocessing run API
2024-08-21 18:03:13,332 - DEBUG - finished postprocessing run API
2024-08-21 18:03:13,332 - DEBUG - finished run API

Yet checking exit code returns zero:

(venv)
[1262] kmabee:yyz-lab-90-special-kmabee-for-reservation-2763199 /localdev/kmabee/mlir2 > $?
-bash: 0: command not found <==== KCM This is exit code zero

Can hack api.py checking code to always through this error if you have hard time reproducing for whatever reason.

I got real unlucky with above run where outputs were all zero, but even in other runs where they are non-zero, they mismatch, but ttrt-run returns with 0 exit code.