lamikr / rocm_sdk_builder

Other
131 stars 11 forks source link

Update Python #66

Closed daniandtheweb closed 2 months ago

daniandtheweb commented 3 months ago

The current Python implementation works great, however I think it may be the time for an update since a few projects such as Stable Diffusion Next are now dropping support for any Python version older than 3.10 (3.11 would be preferrable).

If it's not too hard it would be great to be able to choose between a Python version with a selector, just like the GPU selection menu, this way users would still be able to use Python 3.9 while also being able to install 3.10 or 3.11.

I'll try experimenting with it on my PC.

lamikr commented 3 months ago

Thanks, I agree that the python could be updated to soon to little never version. In addition of the python, I would like to see at some point the possibility of selecting following things from the menu instead of modifying them from the envsetup.sh

daniandtheweb commented 3 months ago

Apparently the first big issue I'm finding is an incompatibility with barectf during the post install commands of python. I've found the "fix" (a temporary one at least) in the barectf issues section:

echo 'Cython<3' > constraint.txt
PIP_CONSTRAINT=constraint.txt pip3 install barectf

With this the post install commands of python3.11.9 work fine.

daniandtheweb commented 3 months ago

I'm able to reach pytorch-vision when building with python3.11.9, however when building pytorch-vision I keep getting this error:

ninja: error: build.ninja:15: expected '=', got ':'
hipcc-args: --print-multiarch
          ^ near here

Any idea about how to edit the ninja file generation in order to avoid this issue @lamikr ?

daniandtheweb commented 3 months ago

I've succesfuly built everything with Python 3.11.9 1: barectf needs the mentioned workaround:

echo 'Cython<3' > constraint.txt
PIP_CONSTRAINT=constraint.txt pip3 install barectf

2: pytorch-vision can't build with hipcc: I've been unable to find any mention of the issue so I'm not sure why, however with the default gcc builds fine (I don't know how to test the GPU acceleration of pytorch-vision so I can't confirm that)

I'll now prepare a PR that allows users to choose the default Python version to build the project for; I'll also include Python 3.10 in the choices as it should, at least in theory, just work with the same workarounds of 3.11, if anyone could test it it would be amazing (I'm quite tired of rebuilding).

lamikr commented 3 months ago

Sorry for not responding earlier, I has been quite busy with other things for last 2 days. For building the python itself, you only needed to change the

BINFO_APP_UPSTREAM_REPO_VERSION_TAG on python binfo file?

I was thinking to tag the the second release based on for the rocm-6.1.1 version tomorrow. I just want it to build rock-solidly on all distros and so far I have tested mageia, ubuntu 22.04 and ubuntu 24.04.

I have the patches mostly done also for the rocm-6.1.2 and I think we could try to do the python 3.11 update for the rocm-sdk builder 6.1.2 version. Have you tested whether this gpu test works with python 3.11?

https://github.com/lamikr/pytorch-gpu-benchmark

daniandtheweb commented 3 months ago

For now I've just tested the base pytorch and works fine on 3.11. I still have to test torch-video and torch-audio but I don't see why they shouldn't work.

In the PR I've prepared I've added a menu to choose a Python version in case the default one doesn't work but I'll revert the menu thing as it's just overcomplicating the config process (I'm quite good at finding out overcomplicated ways to do simple things).

The only two points of the upgrade to Python 3.11 is that barectf needs to be installed using Cython<3 for some strange bug and pytorch-vision can't be built with hipcc. All the other stuff build fine.

daniandtheweb commented 3 months ago

Trying to run the benchmark I get this error:

MIOpen(HIP): Warning [BuildOcl] error: cannot compile inline asm

However it may also have been present using the older Python, I haven't checked.

jeroen-mostert commented 3 months ago

FWIW I could get pytorch-vision to build with hipcc by patching pytorch_vision/setup.py to tweak the GCC-specific logic; specifically, when it assigns platform_tag it suffices to put a platform_tag = "" and guard the whole block with if not is_rocm_pytorch:.

However, I'm getting a different error building onnxruntime; this appears to be unrelated to Python so I'll open a new issue for it.

daniandtheweb commented 3 months ago

That's nice to know, however is pytorch-vision supposed to be built with hipcc in the first place? It's just curiosity as I haven't found any documentation about it.

jeroen-mostert commented 3 months ago

I have no idea. I doubt it as the check for GCC is rather specific in that the script seems to assume that CC (whatever it is) will support GCC's --print-multiarch, and if there is no CC it will fall back to looking for GCC explicitly, but I found the issue easier to fix than figuring out how to change the configuration to explicitly use GCC. Of course that's certainly also still an option. :P

lamikr commented 2 months ago

The pytorch vision build error reason is that for some reason there are more lines in ninja.build command when python 3.11 is used compared to situation where python 3.9 is used between cflags line and post_cflags line.

ninja.build python 3.9 version

ninja_required_version = 1.3
cxx = /opt/rocm_sdk_612/bin/hipcc

cflags = -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -I/opt/rocm_sdk_612/include -I/opt/rocm_sdk_612/hsa/include -I/opt/rocm_sdk_612/rocm_smi/include -I/opt/rocm_sdk_612/rocblas/include -I/opt/rocm_sdk_612/include -I/opt/rocm_sdk_612/hsa/include -I/opt/rocm_sdk_612/rocm_smi/include -I/opt/rocm_sdk_612/rocblas/include -fPIC -DPNG_FOUND=1 -DJPEG_FOUND=1 -DNVJPEG_FOUND=0 -I/home/lamikr/own/rocm/src/sdk/rocm_sdk_builder_612/src_projects/pytorch_vision/torchvision/csrc -I/usr/include/libpng16 -I/home/lamikr/own/rocm/src/sdk/rocm_sdk_builder_612/src_projects/pytorch_vision/torchvision/csrc -I/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/include -I/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/include/TH -I/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/include/THC -I/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/include/THH -I/opt/rocm_sdk_612/include -I/home/lamikr/own/rocm/src/sdk/rocm_sdk_builder_612/src_projects/pytorch_vision/torchvision/csrc/io/image -I/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/include -I/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/include/TH -I/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/include/THC -I/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/include/THH -I/opt/rocm_sdk_612/include -I/opt/rocm_sdk_612/include/python3.9 -c
post_cflags = -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -g0 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1017"' -DTORCH_EXTENSION_NAME=image -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
cuda_dlink_post_cflags =
ldflags =

ninja.build python 3.11 version

ninja_required_version = 1.3
cxx = /opt/rocm_sdk_612/bin/hipcc

cflags = -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -I/opt/rocm_sdk_612/include -I/opt/rocm_sdk_612/hsa/include -I/opt/rocm_sdk_612/rocm_smi/include -I/opt/rocm_sdk_612/rocblas/include -I/opt/rocm_sdk_612/include -I/opt/rocm_sdk_612/hsa/include -I/opt/rocm_sdk_612/rocm_smi/include -I/opt/rocm_sdk_612/rocblas/include -fPIC -I/home/lamikr/own/rocm/src/sdk/rocm_sdk_builder_612/src_projects/pytorch_vision/torchvision/csrc/io/decoder -I/home/lamikr/own/rocm/src/sdk/rocm_sdk_builder_612/src_projects/pytorch_vision/torchvision/csrc/io/video_reader -I/home/lamikr/own/rocm/src/sdk/rocm_sdk_builder_612/src_projects/pytorch_vision/torchvision/csrc/io/video -I/home/lamikr/own/rocm/src/sdk/rocm_sdk_builder_612/src_projects/pytorch_vision/torchvision/csrc -I/usr/include '-I/usr/include/HIP_PATH=/opt/rocm_sdk_612
HIP_PLATFORM=amd
HIP_COMPILER=clang
HIP_RUNTIME=rocclr
ROCM_PATH=/opt/rocm_sdk_612
HIP_ROCCLR_HOME=/opt/rocm_sdk_612
HIP_CLANG_PATH=/opt/rocm_sdk_612/bin
HIP_INCLUDE_PATH=/opt/rocm_sdk_612/include
HIP_LIB_PATH=/opt/rocm_sdk_612/lib
DEVICE_LIB_PATH=/opt/rocm_sdk_612/amdgcn/bitcode
HIP_CLANG_RT_LIB=/opt/rocm_sdk_612/lib/clang/17/lib/linux
hipcc-args: -print-multiarch
hipcc-cmd: "/opt/rocm_sdk_612/bin/clang" --driver-mode=g++ -O3 --hip-path="/opt/rocm_sdk_612" --hip-link --rtlib=compiler-rt -unwindlib=libgcc  -print-multiarch' -I/home/lamikr/own/rocm/src/sdk/rocm_sdk_builder_612/src_projects/pytorch_vision/torchvision/csrc -I/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/include -I/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/include/TH -I/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/include/THC -I/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/include/THH -I/opt/rocm_sdk_612/include -I/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/include -I/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/include/TH -I/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/include/THC -I/opt/rocm_sdk_612/include/python3.11 -c
post_cflags = -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1017"' -DTORCH_EXTENSION_NAME=video_reader -D_GLIBCXX_USE_CXX11_ABI=1
cuda_dlink_post_cflags =
ldflags =
lamikr commented 2 months ago

Those lines are printed out by hipcc to stdout. (code under src_projects/llvm-project/amd/hipcc/src/hipBin_amd.h) And when python 3.9 is used those lines are for some reason not appended to ninja.build but when python 3.11 is used they are.

The code that calls the hipcc is in this case is here:

src_projects/pytorch/torch/utils/cpp_extension.py (starting from line 2308)

One way to fix this is to disable verbose printouts from hipcc but they are kind of handy, so not sure yet what's the best way to fix this.

daniandtheweb commented 2 months ago

Is it actually expected for torchvision to be built with hipcc? I'm unable to find any other build guide that sets hipcc as the compiler and even rocm-arch's pkgbuild file just builds using cmake so maybe the real issue is building with hipcc in the first place here.

lamikr commented 2 months ago

In debian and ubuntu ffmpeg libraries are not directly under /usr/lib but instead in /usr/lib/x86_64-linux-gnu and I believe that the code tries to get that dir by calling the -print-multiarch.

Problem is that this call does not work with hipcc/clang or gcc on my distro which returns error code. Code in setup.py does not check the error code and starts instead parsing the stdoutput.

I think the correct way to implement this is following in pytorch_vision/setup.py starting from line 386

        gcc = os.environ.get("CC", shutil.which("gcc"))
        # hipcc/clang does not support print-multiarch, so check error code first
        platform_tag = subprocess.run([gcc, "-print-multiarch"], stdout=subprocess.PIPE)
        if platform_tag and platform_tag.returncode == 0:
            # Most probably a Debian-based distribution
            platform_tag = platform_tag.stdout.strip().decode("utf-8")
            ffmpeg_include_dir = [ffmpeg_include_dir, os.path.join(ffmpeg_include_dir, platform_tag)]
            ffmpeg_library_dir = [ffmpeg_library_dir, os.path.join(ffmpeg_library_dir, platform_tag)]
        else:
            ffmpeg_include_dir = [ffmpeg_include_dir]
            ffmpeg_library_dir = [ffmpeg_library_dir]

In addition the hipcc verbose output should be disabled unless given as a command line parameter -v or something.

daniandtheweb commented 2 months ago

I'll try to build with this new code then. It will take a while as I started a clean build this morning.

daniandtheweb commented 2 months ago

Another thing I've noticed with the Python update is that there are multiple deprecation warnings about some packages being egg instead of wheels. I hadn't notice if the same warning was present in 3.9.

lamikr commented 2 months ago

Yes, I noticed same. One error I noticed is that the this test fails to find migraphx at least on my 3.11 build.

docs/examples/pytorch/triton_migraph_quantization

I try to lover the numpy back to 1.26.4 and redo the build to check if that helps for this problem.

jeroen-mostert commented 2 months ago

I've had trouble running some torchaudio integration tests with numpy 2.0.0 as well. There are still quite a few packages that haven't made the jump yet since np 2 is very fresh; overall compatibility with 1.26.4 is better.

lamikr commented 2 months ago

AMDMiGraphX turned out to be 1 line fix. 3.11 text needs to be added to cmake/PythonModules.cmake

set(PYTHON_SEARCH_VERSIONS 3.5 3.6 3.7 3.8 3.9 3.10 3.11)

Triton install_rocm.sh needs to be also be changed to install 311 version of wheel instead of 309 version.

lamikr commented 2 months ago

What I have read, they claim that if the numpy 2.0.0 version is installed when you build the package line onnxruntime, then it will be compatible with the 2.0.0 version. But if you build it when you have numpy 1.24.6 installed and afterwards upgrade the numpy to 2.0.0 with pip install, then it will not work.

jeroen-mostert commented 2 months ago

Right, so for building it's probably better to have np 2 installed from the start, and then leave it up to the user to downgrade if they need it for compatibility with other stuff (potentially in a venv).

daniandtheweb commented 2 months ago

I have modified the patch for triton in my PR to fix the install issue. I have also added the patch for AMDMIGraphX adding @lamikr as author and I modified the torchvision patch with the suggested patch in this conversation https://github.com/lamikr/rocm_sdk_builder/issues/66#issuecomment-2198498643.