Classify difference between Intel port and OpenAI Triton

etiotto commented 2 weeks ago

I took a first pass at the difference between the latest OpenAI Triton code upstream and our fork. We have 66 common files showing a difference. To obtain the difference use the following command on the latest llvm-target (the commit ID is from our last merge):

# Search for the last merge commit id in the git log.
COMMIT_ID=`git log | grep "Merge commit" | head -1 | cut -d "'" -f2`

# Obtain the list of modified files and the difference.
echo "*********** MODIFIED FILES ***********"
git diff $COMMIT_ID --diff-filter=CDMRTUXB | grep "diff --" | cut -d"a" -f2- | cut -d" " -f1 | cut -d"/" -f2- 2>&1

echo "*********** DIFFERENCES ***********"
git diff $COMMIT_ID --diff-filter=CDMRTUXB 2>&1

The file containing a difference are in the following table. The 2nd column labeled "Upstreamable" indicates whether the diff. in that file are upstreamable or not, and whether we should attempt to upstream them now or in the future (i.e. we need to upstream our BE in third_party in order to upstream the difference). Specificaly:

"Future" means we cannot upstream now, e.g., it depends on Intel specify features, but should be upstreamable when the OpenAI community accept our BE
- "Now" means it is actionable now, and we should attempt to upstream it
- "Partially" means a mix of both
- "N" means not upstreamable.

The 3rd column indicates whether we should move the difference (or the file) in the third_party/intel directory. The 4th column indicates whether the difference could be reduced or not.

File	Upstreamable	Movable to third_party/intel	Can be reduced?	Comments
.pre-commit-config.yaml	Now	N	N	Contains extra pre-commits
CMakeLists.txt	Future	N	?	set LLVM_CONFIG needed?
LICENSE	Future	N	N
README.md	Future	N	N
bin/CMakeLists.txt	Future	N	N
bin/RegisterTritonDialects.h	Future	N	N
bin/triton-opt.cpp	Future	N	N
docs/conf.py	N	N	Y	Make it common with upstream
docs/index.rst	Future	N	?	Programming Guide Section is not upstream
include/triton/Conversion/TritonGPUToLLVM/Utility.h	N	N	Y	Remove differences
include/triton/Conversion/TritonToTritonGPU/Passes.td	N	N	Y	Remove differences
include/triton/Dialect/Triton/IR/TritonTypes.td	N	N	?	How to remove "F8E4M3B11FNUZ"?
include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td	Now	N	N	Try to upstream changes
include/triton/Tools/Sys/GetEnv.hpp	Future	N	Future	Clean upn INTEL env. variables
lib/Analysis/Utility.cpp	Future	N	N
lib/Conversion/TritonGPUToLLVM/CMakeLists.txt	Future	N	N
lib/Dialect/Triton/IR/Ops.cpp	Now	N	N
lib/Dialect/TritonGPU/IR/CMakeLists.txt	Future	N	N
lib/Dialect/TritonGPU/IR/Dialect.cpp	Future	N	N	Note: some changes for warp layout needs to be removed (not upstreamable)
lib/Dialect/TritonGPU/IR/LinearLayoutConversions.cpp	Now	N	N	Try to upstream
lib/Target/CMakeLists.txt	Future	N	N
pyproject.toml	?	N	?	Can remove "importlib_metadata" or upstream this change ?
python/pyproject.toml	?	N	?	Can remove "importlib_metadata" or upstream the change?
python/setup.py	Upstream part of the changes now	N	Y	Can "get_install_requires" be removed?, address: # FIXME: pytorch<2.3.0 doesn't support numpy 2.0
python/src/ir.cc	N	N	Y	TODO: align with upstream code to use i8, can we remove "get_threads_per_warp" ?
python/src/llvm.cc	N	Y	N	Need SLPVectorization = true, how to make that pass work for us? Alternatively make it vendor specific?
python/test/regression/test_cast_matmul.py	Partially	N	Y	Device passing could be upstreamed ?
python/test/regression/test_functional_regressions.py	Partially	N	Y	Remove import intel_extension_for_pytorch
python/test/unit/instrumentation/test_gpuhello.py	Partially	N	Y	Device passing could be upstreamed ?
python/test/unit/language/assert_helper.py	Partially	N	Y	Pass device instead of hard coding it
python/test/unit/language/print_helper.py	Partially	N	Y	Pass device instead of hard coding it
python/test/unit/language/test_annotations.py	N	N	Y	Remove import intel_extension_for_pytorch
python/test/unit/language/test_block_pointer.py	Partially	N	Y	Remove import intel_extension_for_pytorch, is_cuda()
python/test/unit/language/test_conversions.py	Partially	N	Y	Pass device instead of hard coding it
python/test/unit/language/test_core.py	Partially	N	Y	Need further investigation to reduce diff with upstream
python/test/unit/language/test_line_info.py	Partially	N	Y	Lines have diff. number, try to reduce diff with upstream
python/test/unit/language/test_pipeliner.py	N	N	Y	Remove import intel_extension_for_pytorch
python/test/unit/language/test_random.py	N	N	Y	Remove import intel_extension_for_pytorch
python/test/unit/language/test_subprocess.py	N	N	Y	Need further investigation
python/test/unit/runtime/test_autotuner.py	N	N	Y	Remove import intel_extension_for_pytorch
python/test/unit/runtime/test_cache.py	Partially	N	Y	Pass device instead of hard coding it
python/test/unit/runtime/test_cublas.py	N	N	Y	pytest.skip vs pytest.xfail difference
python/test/unit/runtime/test_driver.py	Future	N	Y	Remove import intel_extension_for_pytorch
python/test/unit/runtime/test_jit.py	Future	N	Y	Remove import intel_extension_for_pytorch
python/test/unit/runtime/test_launch.py	Future	N	Y	Pass device instead of hard coding it
python/test/unit/runtime/test_subproc.py	N	N	Y	Remove import intel_extension_for_pytorch
python/triton/backends/compiler.py	Future	N	N
python/triton/compiler/compiler.py	Future	N	N	Note: we need to add build_flags into metadata, upstream might accept?
python/triton/language/extra/init.py	Future	N	N	Should be upstreamable later
python/triton/language/semantic.py	Future	N	N
python/triton/runtime/build.py	Future	N	N
python/triton/testing.py	N	N	Y	Cleanup USE_WALL_TIME
python/triton/tools/compile.py	N	N	?	What does AMD do for this file?
python/tutorials/01-vector-add.py	Partially	N	Y	Remove import intel_extension_for_pytorch, can pass device?
python/tutorials/02-fused-softmax.py	N	N	Y	Remove import intel_extension_for_pytorch
python/tutorials/03-matrix-multiplication.py	Partially	N	Y	Remove import intel_extension_for_pytorch
python/tutorials/04-low-memory-dropout.py	Partially	N	Y	Remove import intel_extension_for_pytorch
python/tutorials/05-layer-norm.py	Partially	N	Y	Remove import intel_extension_for_pytorch
python/tutorials/06-fused-attention.py	Partially	N	Y	Remove import intel_extension_for_pytorch
python/tutorials/07-extern-functions.py	Partially	N	Y	Remove import intel_extension_for_pytorch
python/tutorials/08-grouped-gemm.py	Partially	N	Y	Remove import intel_extension_for_pytorch
test/CMakeLists.txt	Future	N	N
third_party/nvidia/CMakeLists.txt	N	N	Y	Try to make common
unittest/Conversion/TritonGPUToLLVM/CMakeLists.txt	N	N	?	Investigate
unittest/Conversion/TritonGPUToLLVM/PTXAsmFormatTest.cpp	N	N	?	Investigate
unittest/Dialect/TritonGPU/CMakeLists.txt	Future	N	N

The following table summarizes features (or lack thereof) that need to be redesigned in order to remove the difference thy cause in common files. For example "warp layout" is a feature of our advanced codegen path which requires some redesign to avoid changes in common files.

Feature	Files Affected	Comments
F8E4M3B11FNUZ	include/triton/Dialect/Triton/IR/TritonTypes.td, /python/src/ir.cc	Determine how to align with upstream
warp layout	lib/Dialect/TritonGPU/IR/Dialect.cpp	How to handle warp layout without invasive changes ?
passing device	unit tests/tutorials	Need to pass device to tests/tutorials rather than hardcoding "cuda"

anmyachev commented 2 weeks ago

@etiotto I moved PTXAsmFormatTest.cpp test into third_party/nvidia folder in Triton PR#4608. There should be no differences from upstream in unittest/Conversion/ folder now.

etiotto commented 2 weeks ago

After PR #2064 lands we will have 48 files with differences (down from 66):

git diff a78c9c40aca4f6ad80deef39682a32056ea8976f --diff-filter=CDMRTUXB | grep "diff --" | cut -d"a" -f2- | cut -d" " -f1 | cut -d"/" -f2-|wc                                 ✔  10243  14:55:43 
     48      48    1682

YarShev commented 1 week ago

We have two almost identical documents describing the architecture of triton: https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/docs/ARCHITECTURE.md and https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/docs/getting-started/architecture.rst. I believe we should keep only one and my preference would be in favor of the latter. Do we plan to upstream the document in triton too?

anmyachev commented 1 week ago

@whitneywhtsang @etiotto F8E4M3B11FNUZ type could be removed by moving the logic of processing this type to the frontend. For this, by analogy with NVIDIA, we could use inline_asm_elementwise function, but I did not find in the code a call to this function with Intel's asm. Do you know about its use for Intel's backend? Maybe there are other ways to implement this? I would be very glad to receive any suggestions.

whitneywhtsang commented 1 week ago

@whitneywhtsang @etiotto F8E4M3B11FNUZ type could be removed by moving the logic of processing this type to the frontend. For this, by analogy with NVIDIA, we could use inline_asm_elementwise function, but I did not find in the code a call to this function with Intel's asm. Do you know about its use for Intel's backend? Maybe there are other ways to implement this? I would be very glad to receive any suggestions.

Intel PVC doesn't support inline assembly AFAIK.

YarShev commented 1 week ago

We have two almost identical documents describing the architecture of triton: https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/docs/ARCHITECTURE.md and https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/docs/getting-started/architecture.rst. I believe we should keep only one and my preference would be in favor of the latter. Do we plan to upstream the document in triton too?

@etiotto, @whitneywhtsang, any thoughts on this?

whitneywhtsang commented 1 week ago

We have two almost identical documents describing the architecture of triton: https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/docs/ARCHITECTURE.md and https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/docs/getting-started/architecture.rst. I believe we should keep only one and my preference would be in favor of the latter. Do we plan to upstream the document in triton too?

@etiotto, @whitneywhtsang, any thoughts on this?

It is copied to architecture.rst in https://github.com/intel/intel-xpu-backend-for-triton/commit/b127c84121e03eee058827012c50aab27c232077 @Devjiu what's the reason it is copied and not moved?

whitneywhtsang commented 6 days ago

Status update:

# of modified files: 46

# of new files: 352

third_party/intel: 153
.github: 32
benchmarks: 33
scripts: 45
docs: 15
test: 48
utils/SPIRVRunner: 6

others: 20

 DEVELOPMENT.md
 RELEASE.md
 SECURITY.md
 bandit.yaml
 bin/triton-translate.cpp
 cmake/FindSPIRVToLLVMTranslator.cmake
 include/triton/Target/SPIRV/SPIRVTranslation.h
 lib/Target/SPIRV/CMakeLists.txt
 lib/Target/SPIRV/SPIRVTranslation.cpp
 lib/Target/SPIRV/spirv-llvm-translator.conf
 python/test/conftest.py
 python/test/regression/conftest.py
 python/test/unit/language/test_emulated_atomics.py
 python/test/unit/language/test_libdevice.py
 python/triton/language/extra/intel/__init__.py
 python/triton/language/extra/intel/libdevice.py
 python/triton/language/extra/intel/utils.py
 python/tutorials/10-experimental-block-pointer.py
 python/tutorials/10i-experimental-block-pointer.py
 third-party-programs.txt

intel / intel-xpu-backend-for-triton

Classify difference between Intel port and OpenAI Triton #2030