intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs
MIT License
143 stars 44 forks source link

[UT] regression in test_subprocess.py with the PTDB 0.5.3 #800

Open pbchekin opened 7 months ago

pbchekin commented 7 months ago

12 tests cases are failing:

2024-04-02T21:44:57.4181029Z =========================== short test summary info ============================
2024-04-02T21:44:57.4181419Z FAILED language/test_subprocess.py::test_print[device_print-int16] - assert False
2024-04-02T21:44:57.4182039Z FAILED language/test_subprocess.py::test_print[device_print-long] - assert False
2024-04-02T21:44:57.4182626Z FAILED language/test_subprocess.py::test_print[print-int32] - assert False
2024-04-02T21:44:57.4183238Z FAILED language/test_subprocess.py::test_print[device_print-float32] - assert False
2024-04-02T21:44:57.4183857Z FAILED language/test_subprocess.py::test_print[device_print-int8] - assert False
2024-04-02T21:44:57.4184439Z FAILED language/test_subprocess.py::test_print[device_print-int32] - assert False
2024-04-02T21:44:57.4185023Z FAILED language/test_subprocess.py::test_print[device_print-float16] - assert False
2024-04-02T21:44:57.4185590Z FAILED language/test_subprocess.py::test_print[device_print-float64] - assert False
2024-04-02T21:44:57.4186209Z FAILED language/test_subprocess.py::test_print[device_print-uint8] - assert False
2024-04-02T21:44:57.4186820Z FAILED language/test_subprocess.py::test_print[device_print_hex-int16] - assert False
2024-04-02T21:44:57.4187416Z FAILED language/test_subprocess.py::test_print[device_print_hex-int32] - assert False
2024-04-02T21:44:57.4188022Z FAILED language/test_subprocess.py::test_print[device_print_hex-int64] - assert False
2024-04-02T21:44:57.4188461Z ======================== 12 failed, 21 passed in 24.07s ========================
whitneywhtsang commented 7 months ago

Continue to fail with agama 821.32.

quintinwang5 commented 7 months ago

Blocked by new driver's bug. Already file a JIRA.

quintinwang5 commented 7 months ago

This should not be a driver's bug. Because driver team cannot reproduce it with oneapi 2024.0. I confirmed that in the same environment(should be 821.30), 2024.0 works, but 2024.1 fails. So this may be a compiler regression. Will file a new JIRA to compiler team.

vlad-penkin commented 5 months ago

This issue needs to be rechecked after June Rolling Driver release.

AshburnLee commented 5 months ago

Continue to fail with agama 821.35 & 881.19 6/12/2024

AshburnLee commented 5 months ago

Continue to fail with agama 821. 6/17/2024

AshburnLee commented 4 months ago

Continue to fail with agama 821.

vlad-penkin commented 4 months ago

@AshburnLee could you please retest with the Agama 914?

AshburnLee commented 4 months ago

Continue to fail with agama 914.

AshburnLee commented 4 months ago

Continue to fail with agama 914.

anmyachev commented 4 months ago

Reminder: don't forget to remove: https://github.com/intel/intel-xpu-backend-for-triton/blob/fea510c02ff1bb7b82d2cef31a8ba5fadddf8916/python/test/unit/language/print_helper.py#L119-L122

vlad-penkin commented 3 months ago

With PTDB 0.5.3 and Agama 950 this issue is still reproducible without the repr fix, 29 test variants are failing

FAILED language/test_subprocess.py::test_print[device_print-int8]
FAILED language/test_subprocess.py::test_print[device_print-uint8]
FAILED language/test_subprocess.py::test_print[device_print-int16]
FAILED language/test_subprocess.py::test_print[device_print-int32]
FAILED language/test_subprocess.py::test_print[device_print-long]
FAILED language/test_subprocess.py::test_print[device_print-float16]
FAILED language/test_subprocess.py::test_print[device_print-float32]
FAILED language/test_subprocess.py::test_print[device_print-float64]
FAILED language/test_subprocess.py::test_print[device_print_scalar-int8]
FAILED language/test_subprocess.py::test_print[device_print_scalar-uint8]
FAILED language/test_subprocess.py::test_print[device_print_scalar-int16]
FAILED language/test_subprocess.py::test_print[device_print_scalar-int32]
FAILED language/test_subprocess.py::test_print[device_print_scalar-long]
FAILED language/test_subprocess.py::test_print[device_print_scalar-float16]
FAILED language/test_subprocess.py::test_print[device_print_scalar-float32]
FAILED language/test_subprocess.py::test_print[device_print_scalar-float64]
FAILED language/test_subprocess.py::test_print[print-int32]
FAILED language/test_subprocess.py::test_print[static_print-int32]
FAILED language/test_subprocess.py::test_print[no_arg_print-int32]
FAILED language/test_subprocess.py::test_print[print_no_arg-int32]
FAILED language/test_subprocess.py::test_print[device_print_large-int32]
FAILED language/test_subprocess.py::test_print[print_multiple_args-int32]
FAILED language/test_subprocess.py::test_print[device_print_multiple_args-int32]
FAILED language/test_subprocess.py::test_print[device_print_hex-int16]
FAILED language/test_subprocess.py::test_print[device_print_hex-int32]
FAILED language/test_subprocess.py::test_print[device_print_hex-int64]
FAILED language/test_subprocess.py::test_print[device_print_pointer-int32]
FAILED language/test_subprocess.py::test_print[device_print_negative-int32]
FAILED language/test_subprocess.py::test_print[device_print_uint-uint32]

With repr fix 6 test variant are failing. All 6 tests are included into the default skip list.

test/unit/language/test_subprocess.py::test_print[device_print-float16]
test/unit/language/test_subprocess.py::test_print[device_print-float32]
test/unit/language/test_subprocess.py::test_print[device_print-float64]
test/unit/language/test_subprocess.py::test_print[device_print_scalar-float16]
test/unit/language/test_subprocess.py::test_print[device_print_scalar-float64]
test/unit/language/test_subprocess.py::test_print[device_print_scalar-float32]
vlad-penkin commented 3 months ago

@etiotto and @whitneywhtsang what are the next steps to resolve the issue?

whitneywhtsang commented 3 months ago

@etiotto and @whitneywhtsang what are the next steps to resolve the issue?

There is a CMPLRLLVM ticket opened for this issue, and we should continue to follow up there to have it fixed.