openxla / xla

A machine learning compiler for GPUs, CPUs, and ML accelerators
Apache License 2.0
2.64k stars 418 forks source link

PR #17814: [ROCM] buffer_comparator init bugfix #17969

Closed copybara-service[bot] closed 6 days ago

copybara-service[bot] commented 1 week ago

PR #17814: [ROCM] buffer_comparator init bugfix

Imported from GitHub PR https://github.com/openxla/xla/pull/17814

This PR https://github.com/openxla/xla/pull/11880 created a latent bug on ROCM side which was really hard to track. Due to gemm_algorithm_picker, the problem occurs only for non-zero beta when the output matrix is large enough (so it cannot be filled with two first runs). This results in buffer comparator errors like:

[ RUN      ] CublasLtGemmRewriteTest.LargerBiasMultipleUsersNoRewrite
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1727688442.093248 2145761 buffer_comparator.cc:157] Difference at 10069: -522.617, expected -261.495
E0000 00:00:1727688442.093370 2145761 buffer_comparator.cc:157] Difference at 10070: -520.456, expected -260.414
E0000 00:00:1727688442.093376 2145761 buffer_comparator.cc:157] Difference at 10071: -523.774, expected -262.073
E0000 00:00:1727688442.093381 2145761 buffer_comparator.cc:157] Difference at 10072: -524.935, expected -262.654
E0000 00:00:1727688442.093385 2145761 buffer_comparator.cc:157] Difference at 10073: -520.083, expected -260.228
E0000 00:00:1727688442.093389 2145761 buffer_comparator.cc:157] Difference at 10074: -522.771, expected -261.572
E0000 00:00:1727688442.093393 2145761 buffer_comparator.cc:157] Difference at 10075: -519.994, expected -260.183
E0000 00:00:1727688442.093396 2145761 buffer_comparator.cc:157] Difference at 10076: -524.838, expected -262.605
E0000 00:00:1727688442.093400 2145761 buffer_comparator.cc:157] Difference at 10077: -520.376, expected -260.374
E0000 00:00:1727688442.093404 2145761 buffer_comparator.cc:157] Difference at 10078: -521.808, expected -261.09
2024-09-30 09:27:22.093423: E xla/service/gpu/autotuning/gemm_algorithm_picker.cc:348] Results mismatch between different GEMM algorithms. This is likely a bug/unexpected loss of precision.
E0000 00:00:1727688442.095749 2145761 buffer_comparator.cc:157] Difference at 10069: -783.74, expected -261.495
E0000 00:00:1727688442.095766 2145761 buffer_comparator.cc:157] Difference at 10070: -780.498, expected -260.414
E0000 00:00:1727688442.095770 2145761 buffer_comparator.cc:157] Difference at 10071: -785.475, expected -262.073
E0000 00:00:1727688442.095774 2145761 buffer_comparator.cc:157] Difference at 10072: -787.216, expected -262.654
E0000 00:00:1727688442.095778 2145761 buffer_comparator.cc:157] Difference at 10073: -779.939, expected -260.228
E0000 00:00:1727688442.095782 2145761 buffer_comparator.cc:157] Difference at 10074: -783.97, expected -261.572
E0000 00:00:1727688442.095785 2145761 buffer_comparator.cc:157] Difference at 10075: -779.805, expected -260.183
E0000 00:00:1727688442.095789 2145761 buffer_comparator.cc:157] Difference at 10076: -787.071, expected -262.605
E0000 00:00:1727688442.095793 2145761 buffer_comparator.cc:157] Difference at 10077: -780.378, expected -260.374
E0000 00:00:1727688442.095797 2145761 buffer_comparator.cc:157] Difference at 10078: -782.526, expected -261.09

but in fact it was just because of uninitialized buffers. @xla-rotation could you please take a look ?

Copybara import of the project:

-- 58cd0e78dc19075e7c935d7cdb31676ce868e64c by Pavel Emeliyanenko pavel.emeliyanenko@amd.com:

buffer_comparator init fix

Merging this change closes #17814

FUTURE_COPYBARA_INTEGRATE_REVIEW=https://github.com/openxla/xla/pull/17814 from ROCm:ci_buffer_initialization_fix 58cd0e78dc19075e7c935d7cdb31676ce868e64c