Closed perestoronin closed 3 years ago
Thats WIP ...
I am currently working on the ebuils already in Gentoo portage...
Again ... WIP =)
Again ... WIP =)
It's time to refactoring all ebuild and relocate artefacts to default /opt/rocm instead of troubles with /usr ?
In case after relocating all artefacts to /opt/rocm - compile and insert rocm flag in ebuild of tensorflow for rocm will be trivial
We (@candrews and others) spent a lot of time to install not to "/opt/rocm". Instead of changing all the ebuilds in Gentoo portage and here... I think it would be easier to point tensorflow to new directories(?).
We (@candrews and others) spent a lot of time to install not to "/opt/rocm". Instead of changing all the ebuilds in Gentoo portage and here... I think it would be easier to point tensorflow to new directories(?).
no, it's terrible and impossible - to track tensorflow and patch tensorflow again and again, it's time return to /opt/rocm
Create a patch to change the path and related things to adjust to the installation and upstream it? I don´t think that it is impossible, but the Gentoo maintainer(s) and upstream must be involved.
And if you want the installation to "/opt/rocm" why don´t you use official RPMs from AMD? There is an outdated ebuild "amd-rocm-meta-bin", which you could update. And It seems that ROCm is also prepared for parallel installation of multiple releases, so you would have to adjust to that also.
I cannot decide this (alone)... All ebuilds in this repository depend on the ebuilds which are already in Gentoo portage and those are installing not to "/opt/rocm".
Anyway... this is off-topic related to this issue.
And if you want the installation to "/opt/rocm" why don´t you use official RPMs from AMD?
binary ? No, it's not gentoo way.
I see as It's done in Arch Linux
https://github.com/rocm-arch/rocm-arch
and
https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=tensorflow-rocm
and try adjust same way in gentoo ebuilds to path /opt/rocm, if I achieve any success in this way, I will inform.
And if you want the installation to "/opt/rocm" why don´t you use official RPMs from AMD?
binary ? No, it's not gentoo way.
And installing to "/opt/rocm" from a source based (e)build is not my understanding of Gentoo and FHS ;).
@perestoronin I don´t know why you want to revert all this work instead of adjusting tensorflow... In my opinion, this could be something "new".
To get you informed, I will only accept any PRs which path changes to "/opt/rocm" when this is consistent with Gentoo portage. @candrews What is you opinion?
If overlay rocm as abstract exists and same ebuilds in portage, it's not that need to run succesfull tensorflow with rocm in Gentoo.
If anybody can patch tensorlow and maintain such custom ebuild tensorflow, it's may be solution.
But I can't patch tensorflow to compile with rocm overlay - ths reason fo me, to return from /usr/ to /opt/rocm, because /opt/rocm is native for tensorflow.
But maybe we can solve this together? Or what do you think how I made the adjustments for the other ebuilds? It seems @perfinion is the maintainer of "tensorflow" for Gentoo. Maybe we can adjust the tensorflow installation to work with the current Gentoo way installation of ROCm.
But maybe we can solve this together? Or what do you think how I made the adjustments for the other ebuilds? It seems @perfinion is the maintainer of "tensorflow" for Gentoo. Maybe we can adjust the tensorflow installation to work with the current Gentoo way installation of ROCm.
It's simple try give my troubles - in current ebuild for tensorlow -DUSE_ROCM=1 and try. If maintainer achieve succsessful result - it's all ok, if not, no other way than return to /opt/rocm
It's Gentoo policy to use directories in a certain way similar to FHS (FHS isn't itself Gentoo policy, but Gentoo policy is very similar to it) so I think we should continue doing that. See https://devmanual.gentoo.org/general-concepts/filesystem/index.html that describes the directories and their usages, including /opt:
The /opt top-level should only be used for applications that do not conform to the standard filesystem layout. This particularly includes prebuilt software packages that expect being installed in a single directory.
It's Gentoo policy to use directories in a certain way similar to FHS (FHS isn't itself Gentoo policy, but Gentoo policy is very similar to it) so I think we should continue doing that. See https://devmanual.gentoo.org/general-concepts/filesystem/index.html that describes the directories and their usages, including /opt:
The /opt top-level should only be used for applications that do not conform to the standard filesystem layout. This particularly includes prebuilt software packages that expect being installed in a single directory.
All rules must have agile police, if wanted to made application to work.
@perestoronin A quick search gives me -> https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/1aa008d15cd02321ba56e67dd6f0788ecaf72347/configure.py#L1334 which uses environment variables.
Please let me know which sources and ebuilds you are using... I will give it a try... It must be solvable... It's open source ;-)
@candrews I agree to that and that is the reason why I spent so much time to get the current result.
@perestoronin A quick search gives me -> https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/1aa008d15cd02321ba56e67dd6f0788ecaf72347/configure.py#L1334 which uses environment variables.
Please let me know which sources and ebuilds you are using... I will give it a try...
@candrews I agree to that and the reason why I spent so much time to get the current result.
If set ROCM_PATH to /usr, It's infinitive loop to search building system tensoflow through /usr/includes directories, I can't achieve to fix this stanges in tensorflow, this reason to simplest way to return /opt/rocm to resolve this trouble.
But maybe we can solve this together? Or what do you think how I made the adjustments for the other ebuilds? It seems @perfinion is the maintainer of "tensorflow" for Gentoo. Maybe we can adjust the tensorflow installation to work with the current Gentoo way installation of ROCm.
Yeah, I can easily make tensorflow work with the new paths, that would just be a change during the configure stage. I tried many months ago and there were some deps missing still but maybe now it will work better. The problem is I don't have any hardware to test it with. If I can get help testing it works, I'd love to add support to the TensorFlow gentoo package
I'd love to add support to the TensorFlow gentoo package
Before testing need to compile with active https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/1aa008d15cd02321ba56e67dd6f0788ecaf72347/configure.py#L1334 TF_NEED_ROCM=1
But on current rocm gentoo infrastructures tensorflow compile fails in infinitive loops by scanning of /usr if ROCM_PATH set to /usr instead native /opt/rocm.
PS. I have hardware (GPU AMD Vega Frontier on platform with CPU Ryzen gen 2) to run tensorflow gentoo ebuild for testing.
@perestoronin I think it should be possible to find and solve the infinite loops. Can you provide more information? E.g. a build log and the ebuild itself?
@perfinion If you need ebuilds from this repository as a dependency, then let me know,, I will create PRs and help to maintain them. I have a Radeon RX 560 (gfx803), it is supported by ROCm, so I think it should be usable to test tensorflow.
@perestoronin I think it should be possible to find and solve the infinite loops. Can you provide more information? E.g. a build log and the ebuild itself?
At first try to flag on https://gist.github.com/raw/9ac410e6ec6c4129dc2bc27dcf1825a9
Can you provide more information? E.g. a build log
Yes, but for prepare log may be time long compile.
It's possible change ebuild tensoflow to use systems llvm and llvm-roc instead compile internal same llvm again while compiling tensorflow ebuild itself ?
@perestoronin I think it should be possible to find and solve the infinite loops. Can you provide more information? E.g. a build log and the ebuild itself?
At first try to flag on https://gist.github.com/raw/9ac410e6ec6c4129dc2bc27dcf1825a9
export TF_NEED_ROCM=1
is not anywhere near enough, there are other vars to set with the search paths too.
set export ROCM_PATH=/usr
or wherever the libs are installed. You'll also want export TF_ROCM_AMDGPU_TARGETS=gfx803
or the target of your hardware.
@perfinion If you need ebuilds from this repository as a dependency, then let me know,, I will create PRs and help to maintain them. I have a Radeon RX 560 (gfx803), it is supported by ROCm, so I think it should be usable to test tensorflow.
TensorFlow needs hip (hipcc and hipruntime), miopen, rocblas, rocrand, rocfft, roctracer, hipsparse. How far are we from getting those reviewed and all in the tree from the overlay?
@perfinion If you need ebuilds from this repository as a dependency, then let me know,, I will create PRs and help to maintain them. I have a Radeon RX 560 (gfx803), it is supported by ROCm, so I think it should be usable to test tensorflow.
TensorFlow needs hip (hipcc and hipruntime), miopen, rocblas, rocrand, rocfft, roctracer, hipsparse. How far are we from getting those reviewed and all in the tree from the overlay?
For ROCm 3.8 i had all those ebuilds working. Currently I am working on 3.9 and 3.10, but I think I will skip 3.9 because there was a problem with HIP (a library is not found), hopefully this is or can be solved in 3.10. I will update the missing ebuilds for 3.10 in the evening. If HIP works, we can start reviewing the ebuilds. Or we start a test with ROCm 3.8?
start a test with ROCm 3.8
3.8 obsolete
3.10 contains fix bugs, prefer to select 3.10
@perestoronin That is my prefered way also.
I started to update to 3.10... Currently I stick at:
'sh' '-c' '/var/tmp/portage/sys-devel/hip-3.10.0/work/HIP-rocm-3.10.0/rocclr/../bin/hip_embed_pch.sh /var/tmp/portage/sys-devel/hip-3.10.0/work/hip-3.10.0_build/include /var/tmp/portage/sys-devel/hip-3.10.0/work/HIP-rocm-3.10.0/include /usr/lib/llvm/roc/lib/cmake/llvm /usr' + /usr/lib/llvm/roc/lib/cmake/llvm/../../..//bin/clang -O3 --rocm-path=/var/tmp/portage/sys-devel/hip-3.10.0/work/HIP-rocm-3.10.0/include/.. -std=c++17 -nogpulib -isystem /var/tmp/portage/sys-devel/hip-3.10.0/work/HIP-rocm-3.10.0/include -isystem /var/tmp/portage/sys-devel/hip-3.10.0/work/hip-3.10.0_build/include -isystem /usr/include --cuda-device-only -x hip /tmp/hip_pch.139/hip_pch.h -E In file included from /tmp/hip_pch.139/hip_pch.h:1: In file included from /var/tmp/portage/sys-devel/hip-3.10.0/work/HIP-rocm-3.10.0/include/hip/hip_runtime.h:60: In file included from /var/tmp/portage/sys-devel/hip-3.10.0/work/HIP-rocm-3.10.0/include/hip/hcc_detail/hip_runtime.h:39: /usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/include/g++-v10/cmath:45:15: fatal error: 'math.h' file not found
^~~~~~~~
1 error generated when compiling for gfx803. CMake Error at rocclr/CMakeLists.txt:171 (message): Failed to embed PCH
To fix change complier to llvm-roc via export cc
Ok, I will try that.
@perestoronin Can you provide your changes to the ebuild?
I fixed that by disabling the embedding of PCH.
Now I have the same problem as with the previous version:
-- Check for working CXX compiler: /usr/lib/hip/3.10/bin/hipcc -- Check for working CXX compiler: /usr/lib/hip/3.10/bin/hipcc - broken CMake Error at /usr/share/cmake/Modules/CMakeTestCXXCompiler.cmake:53 (message): The C++ compiler
"/usr/lib/hip/3.10/bin/hipcc"
is not able to compile a simple test program.
It fails with the following output:
Change Dir: /var/tmp/portage/dev-libs/rccl-3.10.0/work/rccl-3.10.0_build/CMakeFiles/CMakeTmp
Run Build Command(s):/usr/bin/ninja cmTC_246e9 && [1/2] Building CXX object CMakeFiles/cmTC_246e9.dir/testCXXCompiler.cxx.o
FAILED: CMakeFiles/cmTC_246e9.dir/testCXXCompiler.cxx.o
/usr/lib/hip/3.10/bin/hipcc -DNDEBUG --amdgpu-target=gfx803 -o CMakeFiles/cmTC_246e9.dir/testCXXCompiler.cxx.o -c testCXXCompiler.cxx
clang-12: error: cannot find HIP runtime. Provide its path via --rocm-path, or pass -nogpuinc to build without HIP runtime.
clang-12: error: cannot find HIP runtime. Provide its path via --rocm-path, or pass -nogpuinc to build without HIP runtime.
ninja: build stopped: subcommand failed.
try -DBUILD_TESTS=OFF - it's help me to pass this bug or case same as in https://github.com/rocm-arch/rocm-arch/blob/master/rccl/PKGBUILD
That didn´t fixed the problem for me.
That didn´t fixed the problem for me.
also this new bug I find today on my stands
clang-12: error: cannot find HIP runtime. Provide its path via --rocm-path...
try same -DCMAKE_CXX_FLAGS="--rocm-path=/opt/rocm-${PV}" but your path, this trick resolved for me same problem.
I find this fresh fix on https://github.com/rocm-arch/rocm-arch/issues/468
I almost complete migrate back to /opt/rocm-${PV} :) and I expected to succesfull compile tensoflow for rocm on Gentoo with my customized of your ebuilds same logic as in AUR PKGBUILDS.
Thanks for the hint. I will try that later.
Your are free to use your customized ebuilds, but to get it into portage the ebuilds should follow the Gentoo rules.
Thanks for the hint. I will try that later.
Your are free to use your customized ebuilds, but to get it into portage the ebuilds should follow the Gentoo rules.
After in Gentoo portage resolve my problems, I will remove my local hardcoded ebuilds, but I need worked solution yesteday :)
PS. -DCMAKE_CXX_FLAGS="--rocm-path=/opt/rocm-${PV}" needed in all sci-libs/* from rocm stack also.
WIP:
>>> /opt/rocm-3.10.0/lib/library/
>>> /opt/rocm-3.10.0/lib/library/Kernels.so-000-gfx900.hsaco -> ../../rocblas/lib/library/Kernels.so-000-gfx900.hsaco
>>> /opt/rocm-3.10.0/lib/library/Kernels.so-000-gfx906.hsaco -> ../../rocblas/lib/library/Kernels.so-000-gfx906.hsaco
>>> /opt/rocm-3.10.0/lib/library/TensileLibrary.dat -> ../../rocblas/lib/library/TensileLibrary.dat
>>> /opt/rocm-3.10.0/lib/library/TensileLibrary_gfx900.co -> ../../rocblas/lib/library/TensileLibrary_gfx900.co
>>> /opt/rocm-3.10.0/lib/library/TensileLibrary_gfx906.co -> ../../rocblas/lib/library/TensileLibrary_gfx906.co
>>> /opt/rocm-3.10.0/lib/library/Kernels.so-000-gfx1011.hsaco -> ../../rocblas/lib/library/Kernels.so-000-gfx1011.hsaco
>>> /opt/rocm-3.10.0/lib/library/TensileLibrary_gfx908.co -> ../../rocblas/lib/library/TensileLibrary_gfx908.co
>>> /opt/rocm-3.10.0/lib/library/TensileLibrary_gfx803.co -> ../../rocblas/lib/library/TensileLibrary_gfx803.co
>>> /opt/rocm-3.10.0/lib/library/Kernels.so-000-gfx908.hsaco -> ../../rocblas/lib/library/Kernels.so-000-gfx908.hsaco
>>> /opt/rocm-3.10.0/lib/library/Kernels.so-000-gfx1010.hsaco -> ../../rocblas/lib/library/Kernels.so-000-gfx1010.hsaco
>>> /opt/rocm-3.10.0/lib/library/Kernels.so-000-gfx803.hsaco -> ../../rocblas/lib/library/Kernels.so-000-gfx803.hsaco
>>> /opt/rocm-3.10.0/lib/librocblas.so.0 -> ../rocblas/lib/librocblas.so.0
--- /usr/
--- /usr/share/
--- /usr/share/doc/
>>> /usr/share/doc/rocBLAS-3.10.0/
>>> /usr/share/doc/rocBLAS-3.10.0/README.md.bz2
>>> /opt/rocm-3.10.0/rocblas/lib/librocblas.so -> librocblas.so.0
>>> /opt/rocm-3.10.0/lib/librocblas.so -> ../rocblas/lib/librocblas.so
>>> sci-libs/rocBLAS-3.10.0 merged.
Nice to see that you are making progress, but this is off-topic here.
The hint with "rocm-path" solved the problem. Thanks for that.
But there are some other problems... Hopefully I get them solved soon...
But there are some other problems... Hopefully I get them solved soon...
Describe problems, it's may be I passed the problems, and my recepts also help u as above?
sci/libs finished WIP on some ebuilds dev-libs and dev-utils and then will try to compile tensorflow rof rocm soon.
I think I have solved the problems, the next step is to create patches.
@perestoronin If you have installed "llvm-roc" to "/opt/..." can you provide the output of: "/opt/[whatever needed here]/bin/clang++ -xhip --rocm-path=[path to rocm] --rocm-device-lib-path=[path to bitcode] main.cpp -v" Please set the path to "clang++", the path to "rocm" and the path to the "bitcode libraries" according to your installation. The main.cpp can be "int main() {}". Would be great, thanks.
@perfinion If you need ebuilds from this repository as a dependency, then let me know,, I will create PRs and help to maintain them. I have a Radeon RX 560 (gfx803), it is supported by ROCm, so I think it should be usable to test tensorflow.
TensorFlow needs hip (hipcc and hipruntime), miopen, rocblas, rocrand, rocfft, roctracer, hipsparse. How far are we from getting those reviewed and all in the tree from the overlay?
I have skipped ROCm 3.9 and 3.10 and I think we should start or proceed with ROCm 4.0. Currently I have updated a few ebuilds to ROCm 4.0, but "rocBLAS" has a problem (which is need by miopen)... I´m working on that. But we could start reviewing the other ebuilds. I think there is some work todo.
For any discussion about ebuilds for ROCm 4.0 ... -> https://github.com/justxi/rocm/issues/177
Due to the fact that I have updated all ebuilds for ROCm 4.0.0, I will skip 3.9/3.10.
ROCm 3.10 Latest Dec 3, 2020