intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs
MIT License
124 stars 35 forks source link

Classify difference between Intel port and OpenAI Triton #2030

Open etiotto opened 2 weeks ago

etiotto commented 2 weeks ago

I took a first pass at the difference between the latest OpenAI Triton code upstream and our fork. We have 66 common files showing a difference. To obtain the difference use the following command on the latest llvm-target (the commit ID is from our last merge):

# Search for the last merge commit id in the git log.
COMMIT_ID=`git log | grep "Merge commit" | head -1 | cut -d "'" -f2`

# Obtain the list of modified files and the difference.
echo "*********** MODIFIED FILES ***********"
git diff $COMMIT_ID --diff-filter=CDMRTUXB | grep "diff --" | cut -d"a" -f2- | cut -d" " -f1 | cut -d"/" -f2- 2>&1

echo "*********** DIFFERENCES ***********"
git diff $COMMIT_ID --diff-filter=CDMRTUXB 2>&1

The file containing a difference are in the following table. The 2nd column labeled "Upstreamable" indicates whether the diff. in that file are upstreamable or not, and whether we should attempt to upstream them now or in the future (i.e. we need to upstream our BE in third_party in order to upstream the difference). Specificaly:

The 3rd column indicates whether we should move the difference (or the file) in the third_party/intel directory. The 4th column indicates whether the difference could be reduced or not.

File Upstreamable Movable to third_party/intel Can be reduced? Comments
.pre-commit-config.yaml Now N N Contains extra pre-commits
CMakeLists.txt Future N ? set LLVM_CONFIG needed?
LICENSE Future N N
README.md Future N N
bin/CMakeLists.txt Future N N
bin/RegisterTritonDialects.h Future N N
bin/triton-opt.cpp Future N N
docs/conf.py N N Y Make it common with upstream
docs/index.rst Future N ? Programming Guide Section is not upstream
include/triton/Conversion/TritonGPUToLLVM/Utility.h N N Y Remove differences
include/triton/Conversion/TritonToTritonGPU/Passes.td N N Y Remove differences
include/triton/Dialect/Triton/IR/TritonTypes.td N N ? How to remove "F8E4M3B11FNUZ"?
include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td Now N N Try to upstream changes
include/triton/Tools/Sys/GetEnv.hpp Future N Future Clean upn INTEL env. variables
lib/Analysis/Utility.cpp Future N N
lib/Conversion/TritonGPUToLLVM/CMakeLists.txt Future N N
lib/Dialect/Triton/IR/Ops.cpp Now N N
lib/Dialect/TritonGPU/IR/CMakeLists.txt Future N N
lib/Dialect/TritonGPU/IR/Dialect.cpp Future N N Note: some changes for warp layout needs to be removed (not upstreamable)
lib/Dialect/TritonGPU/IR/LinearLayoutConversions.cpp Now N N Try to upstream
lib/Target/CMakeLists.txt Future N N
pyproject.toml ? N ? Can remove "importlib_metadata" or upstream this change ?
python/pyproject.toml ? N ? Can remove "importlib_metadata" or upstream the change?
python/setup.py Upstream part of the changes now N Y Can "get_install_requires" be removed?, address: # FIXME: pytorch<2.3.0 doesn't support numpy 2.0
python/src/ir.cc N N Y TODO: align with upstream code to use i8, can we remove "get_threads_per_warp" ?
python/src/llvm.cc N Y N Need SLPVectorization = true, how to make that pass work for us? Alternatively make it vendor specific?
python/test/regression/test_cast_matmul.py Partially N Y Device passing could be upstreamed ?
python/test/regression/test_functional_regressions.py Partially N Y Remove import intel_extension_for_pytorch
python/test/unit/instrumentation/test_gpuhello.py Partially N Y Device passing could be upstreamed ?
python/test/unit/language/assert_helper.py Partially N Y Pass device instead of hard coding it
python/test/unit/language/print_helper.py Partially N Y Pass device instead of hard coding it
python/test/unit/language/test_annotations.py N N Y Remove import intel_extension_for_pytorch
python/test/unit/language/test_block_pointer.py Partially N Y Remove import intel_extension_for_pytorch, is_cuda()
python/test/unit/language/test_conversions.py Partially N Y Pass device instead of hard coding it
python/test/unit/language/test_core.py Partially N Y Need further investigation to reduce diff with upstream
python/test/unit/language/test_line_info.py Partially N Y Lines have diff. number, try to reduce diff with upstream
python/test/unit/language/test_pipeliner.py N N Y Remove import intel_extension_for_pytorch
python/test/unit/language/test_random.py N N Y Remove import intel_extension_for_pytorch
python/test/unit/language/test_subprocess.py N N Y Need further investigation
python/test/unit/runtime/test_autotuner.py N N Y Remove import intel_extension_for_pytorch
python/test/unit/runtime/test_cache.py Partially N Y Pass device instead of hard coding it
python/test/unit/runtime/test_cublas.py N N Y pytest.skip vs pytest.xfail difference
python/test/unit/runtime/test_driver.py Future N Y Remove import intel_extension_for_pytorch
python/test/unit/runtime/test_jit.py Future N Y Remove import intel_extension_for_pytorch
python/test/unit/runtime/test_launch.py Future N Y Pass device instead of hard coding it
python/test/unit/runtime/test_subproc.py N N Y Remove import intel_extension_for_pytorch
python/triton/backends/compiler.py Future N N
python/triton/compiler/compiler.py Future N N Note: we need to add build_flags into metadata, upstream might accept?
python/triton/language/extra/init.py Future N N Should be upstreamable later
python/triton/language/semantic.py Future N N
python/triton/runtime/build.py Future N N
python/triton/testing.py N N Y Cleanup USE_WALL_TIME
python/triton/tools/compile.py N N ? What does AMD do for this file?
python/tutorials/01-vector-add.py Partially N Y Remove import intel_extension_for_pytorch, can pass device?
python/tutorials/02-fused-softmax.py N N Y Remove import intel_extension_for_pytorch
python/tutorials/03-matrix-multiplication.py Partially N Y Remove import intel_extension_for_pytorch
python/tutorials/04-low-memory-dropout.py Partially N Y Remove import intel_extension_for_pytorch
python/tutorials/05-layer-norm.py Partially N Y Remove import intel_extension_for_pytorch
python/tutorials/06-fused-attention.py Partially N Y Remove import intel_extension_for_pytorch
python/tutorials/07-extern-functions.py Partially N Y Remove import intel_extension_for_pytorch
python/tutorials/08-grouped-gemm.py Partially N Y Remove import intel_extension_for_pytorch
test/CMakeLists.txt Future N N
third_party/nvidia/CMakeLists.txt N N Y Try to make common
unittest/Conversion/TritonGPUToLLVM/CMakeLists.txt N N ? Investigate
unittest/Conversion/TritonGPUToLLVM/PTXAsmFormatTest.cpp N N ? Investigate
unittest/Dialect/TritonGPU/CMakeLists.txt Future N N

The following table summarizes features (or lack thereof) that need to be redesigned in order to remove the difference thy cause in common files. For example "warp layout" is a feature of our advanced codegen path which requires some redesign to avoid changes in common files.

Feature Files Affected Comments
F8E4M3B11FNUZ include/triton/Dialect/Triton/IR/TritonTypes.td, /python/src/ir.cc Determine how to align with upstream
warp layout lib/Dialect/TritonGPU/IR/Dialect.cpp How to handle warp layout without invasive changes ?
passing device unit tests/tutorials Need to pass device to tests/tutorials rather than hardcoding "cuda"

anmyachev commented 2 weeks ago

@etiotto I moved PTXAsmFormatTest.cpp test into third_party/nvidia folder in Triton PR#4608. There should be no differences from upstream in unittest/Conversion/ folder now.

etiotto commented 2 weeks ago

After PR #2064 lands we will have 48 files with differences (down from 66):

git diff a78c9c40aca4f6ad80deef39682a32056ea8976f --diff-filter=CDMRTUXB | grep "diff --" | cut -d"a" -f2- | cut -d" " -f1 | cut -d"/" -f2-|wc                                 ✔  10243  14:55:43 
     48      48    1682
YarShev commented 1 week ago

We have two almost identical documents describing the architecture of triton: https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/docs/ARCHITECTURE.md and https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/docs/getting-started/architecture.rst. I believe we should keep only one and my preference would be in favor of the latter. Do we plan to upstream the document in triton too?

anmyachev commented 1 week ago

@whitneywhtsang @etiotto F8E4M3B11FNUZ type could be removed by moving the logic of processing this type to the frontend. For this, by analogy with NVIDIA, we could use inline_asm_elementwise function, but I did not find in the code a call to this function with Intel's asm. Do you know about its use for Intel's backend? Maybe there are other ways to implement this? I would be very glad to receive any suggestions.

whitneywhtsang commented 1 week ago

@whitneywhtsang @etiotto F8E4M3B11FNUZ type could be removed by moving the logic of processing this type to the frontend. For this, by analogy with NVIDIA, we could use inline_asm_elementwise function, but I did not find in the code a call to this function with Intel's asm. Do you know about its use for Intel's backend? Maybe there are other ways to implement this? I would be very glad to receive any suggestions.

Intel PVC doesn't support inline assembly AFAIK.

YarShev commented 1 week ago

We have two almost identical documents describing the architecture of triton: https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/docs/ARCHITECTURE.md and https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/docs/getting-started/architecture.rst. I believe we should keep only one and my preference would be in favor of the latter. Do we plan to upstream the document in triton too?

@etiotto, @whitneywhtsang, any thoughts on this?

whitneywhtsang commented 1 week ago

We have two almost identical documents describing the architecture of triton: https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/docs/ARCHITECTURE.md and https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/docs/getting-started/architecture.rst. I believe we should keep only one and my preference would be in favor of the latter. Do we plan to upstream the document in triton too?

@etiotto, @whitneywhtsang, any thoughts on this?

It is copied to architecture.rst in https://github.com/intel/intel-xpu-backend-for-triton/commit/b127c84121e03eee058827012c50aab27c232077 @Devjiu what's the reason it is copied and not moved?

whitneywhtsang commented 6 days ago

Status update: