justxi / rocm

Ebuilds to install ROCM on Gentoo Linux
38 stars 23 forks source link

Please bump rocm to 3.10.0 - and how to adjust tensorflow to use it? #174

Closed perestoronin closed 3 years ago

perestoronin commented 3 years ago

ROCm 3.10 Latest Dec 3, 2020

justxi commented 3 years ago

Thats WIP ...

justxi commented 3 years ago

I am currently working on the ebuils already in Gentoo portage...

justxi commented 3 years ago

Again ... WIP =)

perestoronin commented 3 years ago

Again ... WIP =)

It's time to refactoring all ebuild and relocate artefacts to default /opt/rocm instead of troubles with /usr ?

In case after relocating all artefacts to /opt/rocm - compile and insert rocm flag in ebuild of tensorflow for rocm will be trivial

justxi commented 3 years ago

We (@candrews and others) spent a lot of time to install not to "/opt/rocm". Instead of changing all the ebuilds in Gentoo portage and here... I think it would be easier to point tensorflow to new directories(?).

perestoronin commented 3 years ago

We (@candrews and others) spent a lot of time to install not to "/opt/rocm". Instead of changing all the ebuilds in Gentoo portage and here... I think it would be easier to point tensorflow to new directories(?).

no, it's terrible and impossible - to track tensorflow and patch tensorflow again and again, it's time return to /opt/rocm

justxi commented 3 years ago

Create a patch to change the path and related things to adjust to the installation and upstream it? I don´t think that it is impossible, but the Gentoo maintainer(s) and upstream must be involved.

And if you want the installation to "/opt/rocm" why don´t you use official RPMs from AMD? There is an outdated ebuild "amd-rocm-meta-bin", which you could update. And It seems that ROCm is also prepared for parallel installation of multiple releases, so you would have to adjust to that also.

I cannot decide this (alone)... All ebuilds in this repository depend on the ebuilds which are already in Gentoo portage and those are installing not to "/opt/rocm".

Anyway... this is off-topic related to this issue.

perestoronin commented 3 years ago

And if you want the installation to "/opt/rocm" why don´t you use official RPMs from AMD?

binary ? No, it's not gentoo way.

I see as It's done in Arch Linux

https://github.com/rocm-arch/rocm-arch

and

https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=tensorflow-rocm

and try adjust same way in gentoo ebuilds to path /opt/rocm, if I achieve any success in this way, I will inform.

justxi commented 3 years ago

And if you want the installation to "/opt/rocm" why don´t you use official RPMs from AMD?

binary ? No, it's not gentoo way.

And installing to "/opt/rocm" from a source based (e)build is not my understanding of Gentoo and FHS ;).

justxi commented 3 years ago

@perestoronin I don´t know why you want to revert all this work instead of adjusting tensorflow... In my opinion, this could be something "new".

To get you informed, I will only accept any PRs which path changes to "/opt/rocm" when this is consistent with Gentoo portage. @candrews What is you opinion?

perestoronin commented 3 years ago

If overlay rocm as abstract exists and same ebuilds in portage, it's not that need to run succesfull tensorflow with rocm in Gentoo.

If anybody can patch tensorlow and maintain such custom ebuild tensorflow, it's may be solution.

But I can't patch tensorflow to compile with rocm overlay - ths reason fo me, to return from /usr/ to /opt/rocm, because /opt/rocm is native for tensorflow.

justxi commented 3 years ago

But maybe we can solve this together? Or what do you think how I made the adjustments for the other ebuilds? It seems @perfinion is the maintainer of "tensorflow" for Gentoo. Maybe we can adjust the tensorflow installation to work with the current Gentoo way installation of ROCm.

perestoronin commented 3 years ago

But maybe we can solve this together? Or what do you think how I made the adjustments for the other ebuilds? It seems @perfinion is the maintainer of "tensorflow" for Gentoo. Maybe we can adjust the tensorflow installation to work with the current Gentoo way installation of ROCm.

It's simple try give my troubles - in current ebuild for tensorlow -DUSE_ROCM=1 and try. If maintainer achieve succsessful result - it's all ok, if not, no other way than return to /opt/rocm

candrews commented 3 years ago

It's Gentoo policy to use directories in a certain way similar to FHS (FHS isn't itself Gentoo policy, but Gentoo policy is very similar to it) so I think we should continue doing that. See https://devmanual.gentoo.org/general-concepts/filesystem/index.html that describes the directories and their usages, including /opt:

The /opt top-level should only be used for applications that do not conform to the standard filesystem layout. This particularly includes prebuilt software packages that expect being installed in a single directory.

perestoronin commented 3 years ago

It's Gentoo policy to use directories in a certain way similar to FHS (FHS isn't itself Gentoo policy, but Gentoo policy is very similar to it) so I think we should continue doing that. See https://devmanual.gentoo.org/general-concepts/filesystem/index.html that describes the directories and their usages, including /opt:

The /opt top-level should only be used for applications that do not conform to the standard filesystem layout. This particularly includes prebuilt software packages that expect being installed in a single directory.

All rules must have agile police, if wanted to made application to work.

justxi commented 3 years ago

@perestoronin A quick search gives me -> https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/1aa008d15cd02321ba56e67dd6f0788ecaf72347/configure.py#L1334 which uses environment variables.

Please let me know which sources and ebuilds you are using... I will give it a try... It must be solvable... It's open source ;-)

@candrews I agree to that and that is the reason why I spent so much time to get the current result.

perestoronin commented 3 years ago

@perestoronin A quick search gives me -> https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/1aa008d15cd02321ba56e67dd6f0788ecaf72347/configure.py#L1334 which uses environment variables.

Please let me know which sources and ebuilds you are using... I will give it a try...

@candrews I agree to that and the reason why I spent so much time to get the current result.

If set ROCM_PATH to /usr, It's infinitive loop to search building system tensoflow through /usr/includes directories, I can't achieve to fix this stanges in tensorflow, this reason to simplest way to return /opt/rocm to resolve this trouble.

https://bugs.gentoo.org/705712

perfinion commented 3 years ago

But maybe we can solve this together? Or what do you think how I made the adjustments for the other ebuilds? It seems @perfinion is the maintainer of "tensorflow" for Gentoo. Maybe we can adjust the tensorflow installation to work with the current Gentoo way installation of ROCm.

Yeah, I can easily make tensorflow work with the new paths, that would just be a change during the configure stage. I tried many months ago and there were some deps missing still but maybe now it will work better. The problem is I don't have any hardware to test it with. If I can get help testing it works, I'd love to add support to the TensorFlow gentoo package

perestoronin commented 3 years ago

I'd love to add support to the TensorFlow gentoo package

Before testing need to compile with active https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/1aa008d15cd02321ba56e67dd6f0788ecaf72347/configure.py#L1334 TF_NEED_ROCM=1

But on current rocm gentoo infrastructures tensorflow compile fails in infinitive loops by scanning of /usr if ROCM_PATH set to /usr instead native /opt/rocm.

PS. I have hardware (GPU AMD Vega Frontier on platform with CPU Ryzen gen 2) to run tensorflow gentoo ebuild for testing.

justxi commented 3 years ago

@perestoronin I think it should be possible to find and solve the infinite loops. Can you provide more information? E.g. a build log and the ebuild itself?

@perfinion If you need ebuilds from this repository as a dependency, then let me know,, I will create PRs and help to maintain them. I have a Radeon RX 560 (gfx803), it is supported by ROCm, so I think it should be usable to test tensorflow.

perestoronin commented 3 years ago

@perestoronin I think it should be possible to find and solve the infinite loops. Can you provide more information? E.g. a build log and the ebuild itself?

At first try to flag on https://gist.github.com/raw/9ac410e6ec6c4129dc2bc27dcf1825a9

perestoronin commented 3 years ago

Can you provide more information? E.g. a build log

Yes, but for prepare log may be time long compile.

It's possible change ebuild tensoflow to use systems llvm and llvm-roc instead compile internal same llvm again while compiling tensorflow ebuild itself ?

perfinion commented 3 years ago

@perestoronin I think it should be possible to find and solve the infinite loops. Can you provide more information? E.g. a build log and the ebuild itself?

At first try to flag on https://gist.github.com/raw/9ac410e6ec6c4129dc2bc27dcf1825a9

export TF_NEED_ROCM=1 is not anywhere near enough, there are other vars to set with the search paths too. set export ROCM_PATH=/usr or wherever the libs are installed. You'll also want export TF_ROCM_AMDGPU_TARGETS=gfx803 or the target of your hardware.

@perfinion If you need ebuilds from this repository as a dependency, then let me know,, I will create PRs and help to maintain them. I have a Radeon RX 560 (gfx803), it is supported by ROCm, so I think it should be usable to test tensorflow.

TensorFlow needs hip (hipcc and hipruntime), miopen, rocblas, rocrand, rocfft, roctracer, hipsparse. How far are we from getting those reviewed and all in the tree from the overlay?

justxi commented 3 years ago

@perfinion If you need ebuilds from this repository as a dependency, then let me know,, I will create PRs and help to maintain them. I have a Radeon RX 560 (gfx803), it is supported by ROCm, so I think it should be usable to test tensorflow.

TensorFlow needs hip (hipcc and hipruntime), miopen, rocblas, rocrand, rocfft, roctracer, hipsparse. How far are we from getting those reviewed and all in the tree from the overlay?

For ROCm 3.8 i had all those ebuilds working. Currently I am working on 3.9 and 3.10, but I think I will skip 3.9 because there was a problem with HIP (a library is not found), hopefully this is or can be solved in 3.10. I will update the missing ebuilds for 3.10 in the evening. If HIP works, we can start reviewing the ebuilds. Or we start a test with ROCm 3.8?

perestoronin commented 3 years ago

start a test with ROCm 3.8

3.8 obsolete

3.10 contains fix bugs, prefer to select 3.10

justxi commented 3 years ago

@perestoronin That is my prefered way also.

justxi commented 3 years ago

I started to update to 3.10... Currently I stick at:

'sh' '-c' '/var/tmp/portage/sys-devel/hip-3.10.0/work/HIP-rocm-3.10.0/rocclr/../bin/hip_embed_pch.sh /var/tmp/portage/sys-devel/hip-3.10.0/work/hip-3.10.0_build/include /var/tmp/portage/sys-devel/hip-3.10.0/work/HIP-rocm-3.10.0/include /usr/lib/llvm/roc/lib/cmake/llvm /usr' + /usr/lib/llvm/roc/lib/cmake/llvm/../../..//bin/clang -O3 --rocm-path=/var/tmp/portage/sys-devel/hip-3.10.0/work/HIP-rocm-3.10.0/include/.. -std=c++17 -nogpulib -isystem /var/tmp/portage/sys-devel/hip-3.10.0/work/HIP-rocm-3.10.0/include -isystem /var/tmp/portage/sys-devel/hip-3.10.0/work/hip-3.10.0_build/include -isystem /usr/include --cuda-device-only -x hip /tmp/hip_pch.139/hip_pch.h -E In file included from /tmp/hip_pch.139/hip_pch.h:1: In file included from /var/tmp/portage/sys-devel/hip-3.10.0/work/HIP-rocm-3.10.0/include/hip/hip_runtime.h:60: In file included from /var/tmp/portage/sys-devel/hip-3.10.0/work/HIP-rocm-3.10.0/include/hip/hcc_detail/hip_runtime.h:39: /usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/include/g++-v10/cmath:45:15: fatal error: 'math.h' file not found

include_next

          ^~~~~~~~

1 error generated when compiling for gfx803. CMake Error at rocclr/CMakeLists.txt:171 (message): Failed to embed PCH

perestoronin commented 3 years ago

To fix change complier to llvm-roc via export cc

justxi commented 3 years ago

Ok, I will try that.

justxi commented 3 years ago

@perestoronin Can you provide your changes to the ebuild?

justxi commented 3 years ago

I fixed that by disabling the embedding of PCH.

Now I have the same problem as with the previous version:

-- Check for working CXX compiler: /usr/lib/hip/3.10/bin/hipcc -- Check for working CXX compiler: /usr/lib/hip/3.10/bin/hipcc - broken CMake Error at /usr/share/cmake/Modules/CMakeTestCXXCompiler.cmake:53 (message): The C++ compiler

"/usr/lib/hip/3.10/bin/hipcc"

is not able to compile a simple test program.

It fails with the following output:

Change Dir: /var/tmp/portage/dev-libs/rccl-3.10.0/work/rccl-3.10.0_build/CMakeFiles/CMakeTmp

Run Build Command(s):/usr/bin/ninja cmTC_246e9 && [1/2] Building CXX object CMakeFiles/cmTC_246e9.dir/testCXXCompiler.cxx.o
FAILED: CMakeFiles/cmTC_246e9.dir/testCXXCompiler.cxx.o 
/usr/lib/hip/3.10/bin/hipcc    -DNDEBUG --amdgpu-target=gfx803 -o CMakeFiles/cmTC_246e9.dir/testCXXCompiler.cxx.o -c testCXXCompiler.cxx
clang-12: error: cannot find HIP runtime. Provide its path via --rocm-path, or pass -nogpuinc to build without HIP runtime.
clang-12: error: cannot find HIP runtime. Provide its path via --rocm-path, or pass -nogpuinc to build without HIP runtime.
ninja: build stopped: subcommand failed.
perestoronin commented 3 years ago

try -DBUILD_TESTS=OFF - it's help me to pass this bug or case same as in https://github.com/rocm-arch/rocm-arch/blob/master/rccl/PKGBUILD

justxi commented 3 years ago

That didn´t fixed the problem for me.

perestoronin commented 3 years ago

That didn´t fixed the problem for me.

also this new bug I find today on my stands

clang-12: error: cannot find HIP runtime. Provide its path via --rocm-path...

try same -DCMAKE_CXX_FLAGS="--rocm-path=/opt/rocm-${PV}" but your path, this trick resolved for me same problem.

I find this fresh fix on https://github.com/rocm-arch/rocm-arch/issues/468

I almost complete migrate back to /opt/rocm-${PV} :) and I expected to succesfull compile tensoflow for rocm on Gentoo with my customized of your ebuilds same logic as in AUR PKGBUILDS.

justxi commented 3 years ago

Thanks for the hint. I will try that later.

Your are free to use your customized ebuilds, but to get it into portage the ebuilds should follow the Gentoo rules.

perestoronin commented 3 years ago

Thanks for the hint. I will try that later.

Your are free to use your customized ebuilds, but to get it into portage the ebuilds should follow the Gentoo rules.

After in Gentoo portage resolve my problems, I will remove my local hardcoded ebuilds, but I need worked solution yesteday :)

PS. -DCMAKE_CXX_FLAGS="--rocm-path=/opt/rocm-${PV}" needed in all sci-libs/* from rocm stack also.

perestoronin commented 3 years ago

WIP:

>>> /opt/rocm-3.10.0/lib/library/
>>> /opt/rocm-3.10.0/lib/library/Kernels.so-000-gfx900.hsaco -> ../../rocblas/lib/library/Kernels.so-000-gfx900.hsaco
>>> /opt/rocm-3.10.0/lib/library/Kernels.so-000-gfx906.hsaco -> ../../rocblas/lib/library/Kernels.so-000-gfx906.hsaco
>>> /opt/rocm-3.10.0/lib/library/TensileLibrary.dat -> ../../rocblas/lib/library/TensileLibrary.dat
>>> /opt/rocm-3.10.0/lib/library/TensileLibrary_gfx900.co -> ../../rocblas/lib/library/TensileLibrary_gfx900.co
>>> /opt/rocm-3.10.0/lib/library/TensileLibrary_gfx906.co -> ../../rocblas/lib/library/TensileLibrary_gfx906.co
>>> /opt/rocm-3.10.0/lib/library/Kernels.so-000-gfx1011.hsaco -> ../../rocblas/lib/library/Kernels.so-000-gfx1011.hsaco
>>> /opt/rocm-3.10.0/lib/library/TensileLibrary_gfx908.co -> ../../rocblas/lib/library/TensileLibrary_gfx908.co
>>> /opt/rocm-3.10.0/lib/library/TensileLibrary_gfx803.co -> ../../rocblas/lib/library/TensileLibrary_gfx803.co
>>> /opt/rocm-3.10.0/lib/library/Kernels.so-000-gfx908.hsaco -> ../../rocblas/lib/library/Kernels.so-000-gfx908.hsaco
>>> /opt/rocm-3.10.0/lib/library/Kernels.so-000-gfx1010.hsaco -> ../../rocblas/lib/library/Kernels.so-000-gfx1010.hsaco
>>> /opt/rocm-3.10.0/lib/library/Kernels.so-000-gfx803.hsaco -> ../../rocblas/lib/library/Kernels.so-000-gfx803.hsaco
>>> /opt/rocm-3.10.0/lib/librocblas.so.0 -> ../rocblas/lib/librocblas.so.0
--- /usr/
--- /usr/share/
--- /usr/share/doc/
>>> /usr/share/doc/rocBLAS-3.10.0/
>>> /usr/share/doc/rocBLAS-3.10.0/README.md.bz2
>>> /opt/rocm-3.10.0/rocblas/lib/librocblas.so -> librocblas.so.0
>>> /opt/rocm-3.10.0/lib/librocblas.so -> ../rocblas/lib/librocblas.so
>>> sci-libs/rocBLAS-3.10.0 merged.
justxi commented 3 years ago

Nice to see that you are making progress, but this is off-topic here.

The hint with "rocm-path" solved the problem. Thanks for that.

But there are some other problems... Hopefully I get them solved soon...

perestoronin commented 3 years ago

But there are some other problems... Hopefully I get them solved soon...

Describe problems, it's may be I passed the problems, and my recepts also help u as above?

sci/libs finished WIP on some ebuilds dev-libs and dev-utils and then will try to compile tensorflow rof rocm soon.

justxi commented 3 years ago

I think I have solved the problems, the next step is to create patches.

justxi commented 3 years ago

@perestoronin If you have installed "llvm-roc" to "/opt/..." can you provide the output of: "/opt/[whatever needed here]/bin/clang++ -xhip --rocm-path=[path to rocm] --rocm-device-lib-path=[path to bitcode] main.cpp -v" Please set the path to "clang++", the path to "rocm" and the path to the "bitcode libraries" according to your installation. The main.cpp can be "int main() {}". Would be great, thanks.

justxi commented 3 years ago

@perfinion If you need ebuilds from this repository as a dependency, then let me know,, I will create PRs and help to maintain them. I have a Radeon RX 560 (gfx803), it is supported by ROCm, so I think it should be usable to test tensorflow.

TensorFlow needs hip (hipcc and hipruntime), miopen, rocblas, rocrand, rocfft, roctracer, hipsparse. How far are we from getting those reviewed and all in the tree from the overlay?

I have skipped ROCm 3.9 and 3.10 and I think we should start or proceed with ROCm 4.0. Currently I have updated a few ebuilds to ROCm 4.0, but "rocBLAS" has a problem (which is need by miopen)... I´m working on that. But we could start reviewing the other ebuilds. I think there is some work todo.

justxi commented 3 years ago

For any discussion about ebuilds for ROCm 4.0 ... -> https://github.com/justxi/rocm/issues/177

justxi commented 3 years ago

Due to the fact that I have updated all ebuilds for ROCm 4.0.0, I will skip 3.9/3.10.