Build issue with DeepSpeed

mritunjaymusale commented 3 weeks ago

When building DeepSpeed I get the following error Even though I have it installed in my system

mritunjaymusale commented 3 weeks ago

As per this comment I ran the command to see what permissions I have for kfd

lamikr commented 3 weeks ago

Hmm, thats new error for me and I believe that the /dev/kfd permissions are also right.

Are you able to test that other things except the DeepSpeed works by running for example

source /opt/rocm_sdk_611/bin/env_rocm.sh
rocminfo

And if that works then some example test apps just to check other parts works


cd /opt/rocm_sdk_611/docs/examples/opencl/hello_world
make
./hello_world

cd /opt/rocm_sdk_611/docs/examples/hipcc/hello_world
./build.sh

/opt/rocm_sdk_611/docs/examples/pytorch
python pytorch_gpu_simple_test.py

python test_torch_migraphx_resnet50.py

jupyter-notebook pytorch_amd_gpu_intro.ipynb

mritunjaymusale commented 3 weeks ago

I skipped the libaio-devel check using export DS_BUILD_AIO=0 It builds but i am not sure if that is the recommended way of building that package
I have noticed another "bug" like situation where the gpu clocks are stuck to high even after pytorch process is killed.

mritunjaymusale commented 3 weeks ago

with regards to testing, i was not able to run the hipp example since it gave me an opencl error. but I was able to run pytorch on my old repos from postgrad days, and it worked fine on those

lamikr commented 3 weeks ago

Thanks for the libaio-issue, I will try to repeat it on Fedora 40. I will need to setup nvme on usb harddrive so tha I can switch between different distros on my desktop fast without relying vm's to test this on fedora 40 properly.

Onnxruntime and deepspeed packages will definetly need more testing and propably other kind of tweakings also, I just integrated them recently as a last 2 packages to be build to the system. And I have at least one issue I need to check whether it's fixed in their upstream devel version and if not I will create a bug for them.

About those opencl tests, did you used the "source /opt/rocm_sdk_611/bin/env_rocm.sh" command on that terminal before running those tests. This is my output for example on new terminal:


[lamikr@localhost ~]$ cd /opt/rocm_sdk_611/docs/examples/opencl/hello_world
[lamikr@localhost hello_world]$ source /opt/rocm_sdk_611/bin/env_rocm.sh
[lamikr@localhost hello_world]$ make
hipcc -g -lOpenCL -o hello_world hello_world.cpp
[lamikr@localhost hello_world]$ ./hello_world 
number of opencl platform devices: 1
Platform id: 0
Number of devices found for platform: 1
context created
program loaded
kernel created
Parameters set
Command queue created for kernel invocation requests
Kernels set to kernel command queue
OpenCL test program success, all 200 values read correctly to krnl_result_arr parameter

mritunjaymusale commented 3 weeks ago

Thanks for the libaio-issue, I will try to repeat it on Fedora 40. I will need to setup nvme on usb harddrive so tha I can switch between different distros on my desktop fast without relying vm's to test this on fedora 40 properly.

I would recommend using the "toolbox" app from fedora it's basically podman minus the sudo non-sense that docker does, hopefully that way you can quickly setup and tear down fedora env for testing builds.

About those opencl tests, did you used the "source /opt/rocm_sdk_611/bin/env_rocm.sh" command on that terminal before running those tests. This is my output for example on new terminal:

I think I did but right now I borked my python packges will try your suggestion again once I have rebuild rocm and reinstall rocm again.

One last thing I would like you to suggest others is that when trying to install new python packages for their respective projects as them to stick with venv in order to avoid messing with rocm's python packages....until this repo becomes stable and close to one-click-install.

mritunjaymusale commented 3 weeks ago

This is what it looks like after doing what you said for tests it's unable to find /usr/bin/ld which is starting to seem like a common pattern imo, here as well as with libaio where the package is installed but the compiler can't pick up on

lamikr commented 3 weeks ago

I found out that f6e807687e is breaking clean builds for me, build is trying to link with some wrong version of so files. Wondering could it be related. I am investigating this now.

lamikr commented 3 weeks ago

Found fix for the opencl error. Makefile of both opencl examples will need to get libOpenCL.so to be found. In my own machine, I had another opencl library also installed and that's why it worked.

-L${ROCM_HOME}/lib64

Now I am doing a clean fedora 40 build and so far things looks good. Fixes are in pull request:

mritunjaymusale commented 3 weeks ago

I was able build branch with the newest commit and your above mentioned steps worked and I got the same output

➜  hello_world ./hello_world 
number of opencl platform devices: 1
Platform id: 0
Number of devices found for platform: 1
context created
program loaded
kernel created
Parameters set
Command queue created for kernel invocation requests
Kernels set to kernel command queue
OpenCL test program success, all 200 values read correctly to krnl_result_arr parameter
➜  hello_world

 ➜  hello_world ./build.sh

rm -f ./hello_world
rm -f hello_world.o
rm -f /opt/rocm_sdk_611/src/*.o
/opt/rocm_sdk_611/bin/hipcc -g -fPIE   -c -o hello_world.o hello_world.cpp
/opt/rocm_sdk_611/bin/hipcc hello_world.o -fPIE -o hello_world
./hello_world
 System minor: 1
 System major: 10
 Agent name: AMD Radeon RX 5700 XT
Kernel input: GdkknVnqkc
Expecting that kernel increases each character from input string by one
Kernel output string: HelloWorld
Output string matched with HelloWorld
Test ok!
➜  hello_world

➜  pytorch python pytorch_gpu_simple_test.py
hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1010:xnack-
hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1010:xnack-
tensor([0.3061], device='cuda:0')
➜  pytorch

➜  pytorch python test_torch_migraphx_resnet50.py

hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1010:xnack-
hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1010:xnack-
Got 1 acc subgraphs and 0 non-acc subgraphs
/opt/rocm_sdk_611/lib/python3.9/site-packages/torch_migraphx/fx/mgx_module.py:101: UserWarning: Input x not on gpu device. Copying to device before execution, however, this will add extra overhead if running a performance benckmark.
  warnings.warn(
tensor([[ 7.7740, -0.9590,  8.6869,  ...,  3.5344,  0.0704, 12.6225],
        [ 8.9159, -0.9358,  7.9358,  ...,  4.2870,  0.8405, 12.9530]],
       device='cuda:0')
➜  pytorch

mritunjaymusale commented 3 weeks ago

Is there a way to recopy files to /opt/rocm... folder after everything is built, if I manually nuke it after messing up python dependencies? that way I can test across various pytorch projects easily without having to spend 6 hours for compiling everything from scratch each time.

lamikr commented 3 weeks ago

I was able to reproduce the libaio issue on Fedora 40 and I do not know yet why it fails to find the libaio.so file from /usr/lib64 on fedora. On Mageia 9, same works just fine and even mageia does not ship any extra cmake config files for it.

I will use your suggested fix and disable the libaioi until better fix is found. I pushed lot of changes to bg57 branch. I opened couple of new issues (53-57) and they are now on same pull-request. I will need to do full clean ubuntu build first before merging the changes in.

mritunjaymusale commented 3 weeks ago

i think for installing compiled files from builddir you might have to refactor build/build.sh file since the func_build_all() not only builds but also installs it. So that function can be split into two, where one does all builds other only does installs. This can be used to achieve the goal of being able to nuke and reinstalling rocm as when needed from sucessful build.

lamikr commented 3 weeks ago

Yes, it should be splitted. In theory you could do re-install for everything by deleting these files under each project in builddir.

.result_install
.result_postinstall

cd builddir
find -name .result_install | xargs -- rm -f
find -name .result_postinstall | xargs -- rm -f
cd ..
./babs.sh -b

Then it should run only the install and post install steps for each project. But I have not really tried it but it should mostly just work.

One thing to fix is to move the "pip-install commands from pre-config phase of 039_01_pytorch.binfo to install phase of another binfo file that is run before it. Maybe should create just own dependencies.binfo file for that and then have the pip-install executed on "install command" instead of preconfig.

If you have time to check, that would be nice.

lamikr commented 3 weeks ago

And btw, nice to see that the example apps worked for you.

Sometimes I just break the build on certain phase like 020 and then do the tar-copy from /opt/rocm_sdk_dir's phase on that time.

cd /opt tar -cvf rocm_sdk_611_phase_020.tar rocm_sdk_611

and if I need to restore it, I just delete the builddir files for newer projects than 020 and then do the babs.sh -b after restoring the backup version of rocm_sdk dir first.

lamikr / rocm_sdk_builder

Build issue with DeepSpeed #52