intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs
MIT License
143 stars 44 forks source link

[tracking] check skipped tests for Agama 914 #936

Closed pbchekin closed 3 months ago

pbchekin commented 7 months ago
test_dot[1-64-128-128-4-True-True-none-tf32-int8-int8-1_0]
test_dot[1-64-128-128-4-True-True-none-tf32-int8-int8-1_1]
test_dot[1-64-128-128-4-False-True-none-tf32-int8-int8-1_0]
test_dot[1-64-128-128-4-False-True-none-tf32-int8-int8-1_1]

Errors:

FAILED language/test_core.py::test_dot[1-64-128-128-4-True-True-none-tf32-int8-int8-1_0] - AssertionError: 
Not equal to tolerance rtol=0.01, atol=0.001

Mismatched elements: 1912 / 8192 (23.3%)
Max absolute difference: 24077
Max relative difference: 1280.42857143
 x: array([[  -8771,   97832,   94483, ...,   35111,   22460,   70087],
       [-124117, -115193,    1862, ...,   -2921,  -65821,  -26059],
       [ -55616,   23155,   64353, ...,  100224,   18555,   42140],...
 y: array([[ -19063,   92262,   85278, ...,   35111,   22460,   70087],
       [-121003, -114413,     854, ...,   -2921,  -65821,  -26059],
       [ -56355,   25225,   74041, ...,  100224,   18555,   42140],...
FAILED language/test_core.py::test_dot[1-64-128-128-4-True-True-none-tf32-int8-int8-1_1] - AssertionError: 
Not equal to tolerance rtol=0.01, atol=0.001

Mismatched elements: 1912 / 8192 (23.3%)
Max absolute difference: 24077
Max relative difference: 1280.42857143
 x: array([[  -8771,   97832,   94483, ...,   35111,   22460,   70087],
       [-124117, -115193,    1862, ...,   -2921,  -65821,  -26059],
       [ -55616,   23155,   64353, ...,  100224,   18555,   42140],...
 y: array([[ -19063,   92262,   85278, ...,   35111,   22460,   70087],
       [-121003, -114413,     854, ...,   -2921,  -65821,  -26059],
       [ -56355,   25225,   74041, ...,  100224,   18555,   42140],...
FAILED language/test_core.py::test_dot[1-64-128-128-4-False-True-none-tf32-int8-int8-1_0] - AssertionError: 
Not equal to tolerance rtol=0.01, atol=0.001

Mismatched elements: 1961 / 8192 (23.9%)
Max absolute difference: 42061
Max relative difference: 626.55555556
 x: array([[ -70131,  -76608,   36342, ...,    -437,    2562,   37103],
       [ -10398,   -2496, -111479, ...,  114795,   89840,   25838],
       [  84214,  -54418, -112739, ...,  -14705,   70750,   20548],...
 y: array([[ -69119,  -78742,   62848, ...,    -437,    2562,   37103],
       [   -994,    -338,  -99921, ...,  114795,   89840,   25838],
       [  92654,  -52598, -101119, ...,  -14705,   70750,   20548],...
FAILED language/test_core.py::test_dot[1-64-128-128-4-False-True-none-tf32-int8-int8-1_1] - AssertionError: 
Not equal to tolerance rtol=0.01, atol=0.001

Mismatched elements: 1961 / 8192 (23.9%)
Max absolute difference: 42061
Max relative difference: 626.55555556
 x: array([[ -70131,  -76608,   36342, ...,    -437,    2562,   37103],
       [ -10398,   -2496, -111479, ...,  114795,   89840,   25838],
       [  84214,  -54418, -112739, ...,  -14705,   70750,   20548],...
 y: array([[ -69119,  -78742,   62848, ...,    -437,    2562,   37103],
       [   -994,    -338,  -99921, ...,  114795,   89840,   25838],
       [  92654,  -52598, -101119, ...,  -14705,   70750,   20548],...

See also https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/8757168039/job/24035243552#step:12:23117

AshburnLee commented 6 months ago

It seems these 4 cases has already in the default skiplist And from my local test, these 4 cases have been skipped:

# env: PVC with agama 821.35
source ./scripts/pytest-utils.sh
cd python/test/unit
TRITON_TEST_SUITE=language pytest -vvv -n 8 --device xpu language/ --ignore=language/test_line_info.py --ignore=language/test_subprocess.py

May be this has been solved somehow

AshburnLee commented 6 months ago

Cases in the test-triton.sh have 5 result status: failed, passed, skipped, xfailed, warnings

Are we supposed to put failed and skipped into the skiplist OR just skipped?

pbchekin commented 6 months ago

May be this has been solved somehow

This issue is to track 4 failing test cases with Agama 821.35. These test cases passed before, so it is a regression and it needs to be investigated.

pbchekin commented 6 months ago

Are we supposed to put failed and skipped into the skiplist OR just skipped?

There are two options to skip a test case:

  1. Update Python code with pytest.skip for the specific conditions.
  2. Add this test case to the skip list.

We probably want to use the latter method, because it allows to skip tests depending on the environment. For example, we can have a skip list for PVC with the rolling driver, PVC with the LTS driver, A770 with the rolling/LTS driver, and so on. We currently use both methods as a transitional step, and the plan is to use the skip list for new failures (regressions) and gradually replace pytest.skip with adding test cases to the corresponding skip list.

pbchekin commented 5 months ago

@AshburnLee please verify if these tests fail with the latest rolling.

AshburnLee commented 5 months ago

@AshburnLee please verify if these tests fail with the latest rolling.

They still fail on latest llvm-target branch with the current Rolling(821.35). On latest rolling? do I need to update the driver version? Or is there any platform with the latest Rolling that I can borrow from?

pbchekin commented 5 months ago

They still fail on latest llvm-target branch with the current Rolling(821.35).

Thanks. The latest rolling is 821.36, I think. It would be nice to check it as well. We want to keep this issue open until driver or tests fixed.

AshburnLee commented 5 months ago

We want to keep this issue open until driver or tests fixed.

Oh, so we just track it, and no need to to find the commit that causes those 4 fails? We can do that, but building Triton from very early commits needs efforts and time.

pbchekin commented 5 months ago

Oh, so we just track it, and no need to to find the commit that causes those 4 fails? We can do that, but building Triton from very early commits needs efforts and time.

Right, just to track at the moment. I don't think it is a Triton commit that caused the failures, they started to fail when we updated GPU driver.

AshburnLee commented 5 months ago

4 cases still fail on latest llvm-target branch with the current Rolling(821.35). 6/5/2024

AshburnLee commented 5 months ago

4 cases still fail on latest llvm-target branch with the current Rolling(821.35). 6/12/2024 Plus additional 101 FAILED cases in test_dot: RuntimeError: Triton Error [ZE]: 0x78000018

4 cases still fail on latest llvm-target branch with Rolling 881.19, No extra failed cases.

AshburnLee commented 5 months ago

4 cases still fail on latest llvm-target branch with Rolling 881.19. 4 cases passed on latest llvm-target branch with Rolling 821. 6/17/2024

AshburnLee commented 4 months ago

Got error while running test on 821.35: /lib/python3.10/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so: undefined symbol: _ZNK4sycl3_V16device32ext_oneapi_supports_cl_extensionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPNS0_3ext6oneapi12experimental10cl_versionE

AshburnLee commented 4 months ago

4 cases got PASSED on 914(914.27)

pbchekin commented 4 months ago

Waiting for the new agama release