hipBLASLt builds for gfx90a?

jeroen-mostert commented 3 months ago

So there's

#CFG_TEMP1=-DAMDGPU_TARGETS="${SEMICOLON_SEPARATED_GPU_TARGET_LIST_DEFAULT}"
#CFG_TEMP1=-DAMDGPU_TARGETS="all"
CFG_TEMP1=-DAMDGPU_TARGETS="gfx90a"

What's the reason we end up building hipBLASLt for gfx90a only, and specifically? It seems like this renders the thing fairly useless (but it still takes many minutes to build). ~Does anything else depend on it?~ Yes, apparently Pytorch does, but then it's not clear why that would benefit from a gfx90a-specific build.

lamikr commented 3 months ago

There are couple of apps which does not build for all gpus because their kernels are using the matrix insrictions that are available either only on MI200-300 level of GPUs or in Navi11 cards,

It should be possible to write alternaive kernels for those projects that uses other instructions instead but that's not yet possible. Composable kernel has also something similar in newest codebase that does not support older cards. (look include/grids from developer branch)

I am not sure what is the best approach, maybe instead of direct assembly some of those kernels can be written with triton language which in theory has suppor also for older navi cards. (Have not really tested)

jeroen-mostert commented 3 months ago

Yes, I've stumbled across something like this when I wanted to use the profiler to get hardware counters for occupancy. This doesn't work on anything that's not Instinct either, because AMD thinks the winning strategy is product segmentation. :P

OK, so hipBLASLt won't work on anything but the specific arch it supports, but then why are we building it (and/or having Pytorch take a dependency on it) if that arch is not a configured target? Getting it to build for our targets is a much more challenging task (if it can be done at all, nobody benefits from an implementation that technically works but has pathological perf), but as long as that's not done yet it seems like building this (and everything else that only works on a handful of arches) is just needlessly adding to the already prodigious build time. The people who do have Instinct cards to play with are probably not using this build at all to begin with, as they have AMD's proprietary builds to work with.

jeroen-mostert commented 3 months ago

For a local build I've had easy results by just renaming every yaml file but one to yaml_. A substantial reduction in build time at no loss of functionality... well, for me of course.

lamikr commented 3 months ago

@jeroen-mostert Maybe something which use similar type gpu filtering idea I just implemented for aotriton could help?

https://github.com/lamikr/rocm_sdk_builder/issues/128

jeroen-mostert commented 3 months ago

For aotriton, if you don't have a gfx90a or gfx11+ you're probably better off... not having aotriton and explicitly building PyTorch without flash attention support, since you won't have any working support anyway -- but support for removing the dependency this way hasn't landed in mainstream yet (https://github.com/AngryLoki/pytorch/commit/38d6f115c65b67946de6765165ece0288c8416d4). Alternate implementations that attempt to provide some measure of support could be used, but we're not there yet either.

For hipBLASLt I guess the current situation is as good as it gets: it builds for gfx90a, since that's what it supports. Further filtering isn't really possible, unless it can build for gfx000 (the generic fallback), haven't tested that yet. Obviously my crude method of neutering even the gfx90a implementation just to get the build time down is not ideal for anyone except people who are annoyed by long build times, I don't suggest we implement that as some sort of switch :P

If we do manage to include support for more arches in hipBLASLt, though, gating the arches is definitely worth it, because Tensile (lite) takes such an extremely long time to brute force its way through all combinations. When this can be avoided, it easily pays off to do so.

Of course the real solution (IMO) is to have a proper JIT compiler/profiler so Tensile can just go away or at least kick in only for relevant operations on relevant devices. Some AMD engineers may be working on this in their copious spare time...

lamikr commented 3 months ago

I addded new benchmark script that will run same test with default, math, flashattention and memory efficient algorithm enabled and measure the time.

docs/examples/pytorch/flash_attention/flash_attention_dot_product_benchmark.py

And then on new benchmarks directory there is a script that I use to collect the results.

I have now run and added couple of benchmarks results with gfx1030, 1035, 1102 and 1103 both with the 2.3.1 and 2.4.0 python version. But I still need to do still couple of builds to have all data that will be useful for comparison. Mainly I am interested to check is there any differences in results on these scenarios:

1) gfx1102 and gfx1103

benchmark with pytorch 2.3.1
benchmark with pytorch 2.4.1 with without tuning data for gfx1102 and 1103
benchmark with pytorch 2.4.1 with tuning data for gfx1102 and 1103

2) gfx1030 and 1035

benchmark with 2.3.1
benchmark with 2.4.0 with old aotriton
- benchmark with 2.4.0 with new aotriton which has the gfx110x support

3) I also updated and made fixes to cpu_vs_gpu benchmark. Now it's fully doing the operations either in cpu memory or gpu memory. The results for gfx1035 compared to cpu are suspiciously good on that small benchmark that does matrix multiply and additions.

Do you have time to check what you get on gfx1036 and also to double check that there are no any mistakes in my benchmark.

Benchmarking CPU and GPUs
Pytorch version: 2.4.1-rc1
ROCM HIP version: 6.1.40093-e0d934acc
       Device: cpu-16
    'CPU time: 26.592 sec
       Device: AMD Radeon Graphics
    'GPU time: 0.704 sec
Benchmark ready

jeroen-mostert commented 3 months ago

I tested it on my PyTorch 2.3.1 instance and unadjusted build just for fun. It throws up a bunch of errors that FA is not built for the GPU scenario (as expected) but then the CPU scenario is clearly suspect, because it only uses a single core. Obviously you're going to get really bad results with that and this cannot be considered representative of comparing it to a GPU implementation, so maybe an additional layer of parallelization is needed for a fair comparison (ideally managed wholly by Torch, of course). Unless this is one of those things where you can't meaningfully parallelize it anyway because it needs to go through all the data sequentially by the nature of the operation, in which case disregard all that. :P

It would also need to print the actual CPU name/model, not just how many cores it has, because things like AVX-512 perf make a huge difference in PyTorch speed (as shown by the recently released Zen 5 procs).

For a real test, I would be interested if ExLlamav2 can use it. Last time I checked the answer is "no", because the flash attention built by AMD is an ancient version (2.0.something) that doesn't work, not even if the version check is overridden, but there may be another real world application we can use for demonstration.

Just to clarify before I kick off that long, lonely build: which branch are we building?

lamikr commented 3 months ago

I could not figure out how to get the cpu model name via the pytorch via.

jeroen-mostert commented 3 months ago

There's a py-cpuinfo module that gets this info from /proc/cpuinfo. It doesn't need any compiled code so it's highly compatible (though I agree it would make a lot of sense if PyTorch just exposed this info itself, at least whether or not it was compiled with/is using AVX-512, since it's pretty relevant).

lamikr commented 2 months ago

I ended up implementing the cpu-check directly to tests. Strangely the pytorch has also active cpuinfo project at https://github.com/pytorch/cpuinfo project but I think it's not integrated to python or pytorch itself...

Are you able to run the benchmarks with your 1036? I have now re-run the benchmarks for the 1030/35/1102 and 1103 and the new aotriton really makes difference for 1102 and 1103. I have done the tests on kernel 6.11-rc2.

With bigger sequence size on flashattention test, I was seeing sometimes gpu-hangups (that kernel recovered) on gfx1103 but not on gfx1035.

jeroen-mostert commented 2 months ago

Since the other branches have been cleaned up I'm assuming master is good to go now. I'll kick off an overnight build for that and see what we get. In terms of parameters it would be good to see if we can replicate llama-like scenarios, since it has its own FA implementation (which, however, is lacking in performance for AMD devices). That does involve cranking sequence sizes (and heads, since real models typically use 64 or 128).

lamikr commented 2 months ago

Yes, master should be now good to got. I just kicked clean gfx1010/rx 5700 build with good old amd 3700 to verify everything one more time on older devices.

jeroen-mostert commented 2 months ago

Alas, my overnight build failed; I'm getting a head-scratching error in configure for ucx. Of course it would be tempting to think this is due to the latest patch I myself authored for it, but that doesn't appear to be the case, it's in a completely different area of the code.

The issue was fixed when I nuked the source directory and reran babs.sh -i. Prior to this, I had done babs.sh -f, -co, -ap and -fs and rm -rf builddir. It seems that under circumstances it's possible for cruft to remain this way that can interfere with a build. Of course now that I've reset the source dir I don't have a good way to repro the issue anymore, it's just a note. If I had to speculate, it's that the automake/autoconf family leaves files that get ignored by git (as they should be) but then not get cleaned up by any of the babs steps. It's harder to make in-tree builds robust.

jeroen-mostert commented 2 months ago

I can now replicate the DeepSpeed build error. I can confirm this is not a memory issue, my 96 GiB machine isn't even close to being out of memory when it happens. I suspect there's some problem with dependencies going on where a command is run before its prerequisites have finished building, or else intermediate files are deleted before a command that's needed can get at them, but I'll have to dig some more to confirm it.

jeroen-mostert commented 2 months ago

Ugh. Well, the problem doesn't occur consistently, and moreover neither DeepSpeed nor Torch do anything special that would lead me to suspect the problem is there, so it might be something in distutils, or somewhere even more subtle like the threadpool code itself. At this point I'm not even willing to guarantee the problem definitely will never occur with a single-core build, it would presumably just be much harder to replicate if it exists.

jeroen-mostert commented 2 months ago

Anyhoo, after the build, the benchmark fails to work on gfx1035, choking immediately on the .rand() call:

Pytorch version: 2.4.1-rc1
dot product calculation test
tensor([[[-0.1232,  1.2125,  0.0293,  0.0863, -0.2193,  0.6728,  0.0273,
           0.1507],
         [-0.0771,  2.5381,  0.0028, -0.2773,  0.0395,  0.2234, -0.6021,
           0.4878],
         [ 0.6627,  2.3891, -0.2701,  0.1629, -0.4375,  0.3109, -0.4178,
           0.0445]],

        [[-0.0027, -0.4438,  0.0323,  1.0016,  0.4956,  0.6640, -0.1734,
           1.5455],
         [-0.6252, -0.2658,  0.2918,  0.1089,  1.6547, -0.6152, -0.6624,
           0.6817],
         [ 0.1177, -0.4313, -0.0034,  1.2671,  0.5264,  1.0305, -0.1866,
           2.2272]]], device='cuda:0')

Benchmarking cuda and cpu with Default, Math, Flash Attention amd Memory pytorch backends
Device: AMD Radeon RX 6800 XT / cuda:0
    Default benchmark:
        3927.763 microseconds, 0.0039277630997821686 sec
    SDPBackend.MATH benchmark:
        2927.495 microseconds, 0.002927494869294131 sec
    SDPBackend.FLASH_ATTENTION benchmark:
        3930.010 microseconds, 0.003930010483600199 sec
    SDPBackend.EFFICIENT_ATTENTION benchmark:
        3919.468 microseconds, 0.003919467550197926 sec
Traceback (most recent call last):
  File "/home/jeroen/rocm_sdk_builder/docs/examples/pytorch/flash_attention/flash_attention_dot_product_benchmark.py", line 95, in <module>
    query = torch.rand(batch_size, num_heads, max_sequence_len, embed_dimension, device=dev_type_item, dtype=dtype)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: HIP error: invalid device function
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

Another oddity is that neither CUDA_VISIBLE_DEVICES nor ROCR_VISIBLE_DEVICES has any effect on what Torch sees. Device 1 (the iGPU) remains visible despite these; only manually overriding dev_type_arr works.

I note that the benchmark contains no verification code whatsoever, which makes me disinclined to believe the results as-is. We should probably include a single verification step against the CPU code using no FA in every cycle, rather than invoking it on completely a random tensor every time and assuming the result is sane. No prizes for doing the wrong thing very many times per second.

I don't have the time to troubleshoot this further at the moment since running this on the iGPU is not a priority for me to begin with, though I can run additional commands and relay the output if it helps.

jeroen-mostert commented 2 months ago

Welp, disregard ~all~ most of the above (the benchmark validation stuff still applies). Of course I should have built it for the gfx1036, not the gfx1035. It's confusing that I'm no longer seeing any errors that support for gfx1036 was not compiled, so apparently some support was, but without kernels this will of course not work.

Time for another long and lonely build, since AFAIK the only way to reliably change things when you've changed the device list is to start over from scratch.

This makes me think putting some effort into detecting the system GPUs and using them as defaults in the selection may be worthwhile, because especially for iGPUs it's 1) hard to know your actual model number and 2) super easy to make a mistake since they're so close. Unfortunately this may be a chicken-and-egg problem since there seems to be no easy, distro-independent way of getting the gfx* identifiers without having built the basic AMD tools.

lamikr commented 2 months ago

Alas, my overnight build failed; I'm getting a head-scratching error in configure for ucx. Of course it would be tempting to think this is due to the latest patch I myself authored for it, but that doesn't appear to be the case, it's in a completely different area of the code.

The issue was fixed when I nuked the source directory and reran babs.sh -i. Prior to this, I had done babs.sh -f, -co, -ap and -fs and rm -rf builddir. It seems that under circumstances it's possible for cruft to remain this way that can interfere with a build. Of course now that I've reset the source dir I don't have a good way to repro the issue anymore, it's just a note. If I had to speculate, it's that the automake/autoconf family leaves files that get ignored by git (as they should be) but then not get cleaned up by any of the babs steps. It's harder to make in-tree builds robust.

My clean build on gfx1010/rx 3700 worked until stopped on final part on deepspeed. Restarted the build and deepspeed also finished.

lamikr commented 2 months ago

Welp, disregard ~all~ most of the above (the benchmark validation stuff still applies). Of course I should have built it for the gfx1036, not the gfx1035. It's confusing that I'm no longer seeing any errors that support for gfx1036 was not compiled, so apparently some support was, but without kernels this will of course not work.

Time for another long and lonely build, since AFAIK the only way to reliably change things when you've changed the device list is to start over from scratch.

This makes me think putting some effort into detecting the system GPUs and using them as defaults in the selection may be worthwhile, because especially for iGPUs it's 1) hard to know your actual model number and 2) super easy to make a mistake since they're so close. Unfortunately this may be a chicken-and-egg problem since there seems to be no easy, distro-independent way of getting the gfx* identifiers without having built the basic AMD tools.

Yes, it would require building at least the rocminfo or amd-smi first. And it would help if the menu-creation utility used would allow adding the "additional descriptive text" in addition of the gfx1* values. So there could be "AMD RX 6800/gfx1030"

Btw, I have put the printout warning about missing CO-files under the environment variable.

patches/rocm-6.1.2/clr/0002-ROCM_SDK_DEBUG-environment-variable-to-print-warning.patch

jeroen-mostert commented 2 months ago

Even reconfiguring and after a rebuild with cleaning out the builddir, it still fails with "invalid device function". The ROCM_SDK_DEBUG indeed gives lots of warnings about not having kernels for gfx1036 (in libtorch_hip and libmagma). There may be some caching going on somewhere, or maybe an in-source build wasn't properly cleaned up. At this point I'm probably looking at yet another build, this time with /opt completely clean and all the sources reinitialized as well. This will be a while as I have other things to take care of.

lamikr commented 2 months ago

Hmm, are any of the simpler examples working for pytorch in docs/examples/pytorch directory?

Or just to narrow down, how about the more basic examples which doent require whole stack in docs/examples/hipcc or docs/examples/opencl? Maybe some part of the code is missing the gfx1036 definition...

pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp is the location where the pytorch does lot of checks whether it can use the flash-attention or memory efficient algorithms but it sounds like the problem for you happens somewhere earlier.

Mika

jeroen-mostert commented 2 months ago

Yes, to be clear, the function it's failing on is just the initial call to torch.rand, it doesn't get to any of the FA stuff. If that doesn't work there's little hope of anything else. Basic stuff like llama.cpp (which needs only rocBLAS/hipBLAS) is working. The "hello world" example doesn't work either, failing with an "invalid device function" on torch.isfinite. It looks like something went wrong later in the stack.

lamikr commented 2 months ago

Hmm, not sure where the missing part of code is. Maybe I should do the 1036 build and then use hsa_override for 1035 to fake it to be 1036. I found that amdmigraph was missing 1036 support on it's patch... Not sure could that help on this issue, but I will push that change in. I also noticed that many python projects were still missing the --force-reinstall flag for pip. Lack of that causes that packages does not get reinstalled automatically if doing small changes and rebuilding. I will push that change also soon once test build finishes.

At the moment the patches are here: https://github.com/lamikr/rocm_sdk_builder/pull/138

And for amdsmi-project I have now much better patch to solve the pyroch double-free problem. Better to rebuild binfo/core/013_02_amdsmi.binfo also once I have pushed the change. (I have tested that it's rebuild does not require rebuilding pytorch as pytorch uses the dynamic library from amdsmi)

lamikr commented 2 months ago

Hmm, not sure where the missing part of code is. Maybe I should do the 1036 build and then use hsa_override for 1035 to fake it to be 1036. I found that amdmigraph was missing 1036 support on it's patch... Not sure could that help on this issue, but I will push that change in. I also noticed that many python projects were still missing the --force-reinstall flag for pip. Lack of that causes that packages does not get reinstalled automatically if doing small changes and rebuilding. I will push that change also soon once test build finishes.

And for amdsmi-project I have now much better patch to solve the pyroch double-free problem. Better to rebuild binfo/core/013_02_amdsmi.binfo also once I have pushed the change. (I have tested that it's rebuild does not require rebuilding pytorch as pytorch uses the dynamic library from amdsmi)

Ugh. Well, the problem doesn't occur consistently, and moreover neither DeepSpeed nor Torch do anything special that would lead me to suspect the problem is there, so it might be something in distutils, or somewhere even more subtle like the threadpool code itself. At this point I'm not even willing to guarantee the problem definitely will never occur with a single-core build, it would presumably just be much harder to replicate if it exists.

At least I have not even found any clear error message. It will just fail.

lamikr commented 2 months ago

hmm, "pip --force-reinstall" causes problems. I think the command for pytorch-video stupidly uninstalled the own build pytorch and then installed the nvidia version from internet...

Instead of pip install --force-reinstall "$latest_wheel_file" this seems to work better to just install latest version of locally build wheels.

pip uninstall --yes "$latest_wheel_file"
pip install "$latest_wheel_file"

lamikr commented 2 months ago

pip issues are now fixed and pushed to upstream

lamikr commented 2 months ago

Have you found any reason for your failures? I pushed some improvements to ./babs.sh -co to handle the submodules reset/updates on same command, That could affect to pytorch if old source version was 2.3.1 and new one is 2.4.x.

jeroen-mostert commented 2 months ago

I'm starting to think all the subcommands of babs are basically something you should never use unless you really are a "power user". :P For example, I got a bunch of errors on babs.sh -co. I was a bit mystified until I realized that I hadn't done a babs.sh -f before, meaning my sources were not up to date, meaning -co tried to checkout tags that weren't there. As there is realistically no way to tell, with 80 projects, if the source or the checkout tags have changed, nobody should be doing that unless they're actively developing and they actually know the changes.

Then I did a fetch and got a bunch of new, exciting messages for pytorch:

warning: 01ac329aacf5ad74689d216418f000f698d37105:.gitmodules, multiple configurations found for 'submodule.third_party/pybind11.ignore'. Skipping second one!
warning: 01ac329aacf5ad74689d216418f000f698d37105:.gitmodules, multiple configurations found for 'submodule.third_party/pybind11.path'. Skipping second one!
warning: 01ac329aacf5ad74689d216418f000f698d37105:.gitmodules, multiple configurations found for 'submodule.third_party/pybind11.url'. Skipping second one!
warning: 01ac329aacf5ad74689d216418f000f698d37105:.gitmodules, multiple configurations found for 'submodule.third_party/eigen.ignore'. Skipping second one!
warning: 01ac329aacf5ad74689d216418f000f698d37105:.gitmodules, multiple configurations found for 'submodule.third_party/eigen.path'. Skipping second one!
warning: 01ac329aacf5ad74689d216418f000f698d37105:.gitmodules, multiple configurations found for 'submodule.third_party/eigen.url'. Skipping second one!

After completely nuking the directory and doing -i the source seemed to be OK again, in that no more unexpected messages occurred, but now of course I don't trust the rest anymore either, so I'll just start over. I'll let you know what happens with the gfx1036 support.

jeroen-mostert commented 2 months ago

Runtime error building hipBLASlt:

Compiling source kernels: Done. (17.0 secs elapsed)
Traceback (most recent call last):
  File "/home/jeroen/rocm_sdk_builder/builddir/025_02_hipBLASLt/library/../virtualenv/lib/python3.11/site-packages/Tensile/bin/TensileCreateLibrary", line 43, in <module>
    TensileCreateLibrary()
  File "/home/jeroen/rocm_sdk_builder/builddir/025_02_hipBLASLt/virtualenv/lib/python3.11/site-packages/Tensile/TensileCreateLibrary.py", line 60, in wrapper
    res = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeroen/rocm_sdk_builder/builddir/025_02_hipBLASLt/virtualenv/lib/python3.11/site-packages/Tensile/TensileCreateLibrary.py", line 1465, in TensileCreateLibrary
    LibraryIO.write(masterFile, Utils.state(fullMasterLibrary), args.LibraryFormat)
  File "/home/jeroen/rocm_sdk_builder/builddir/025_02_hipBLASLt/virtualenv/lib/python3.11/site-packages/Tensile/LibraryIO.py", line 60, in write
    writeMsgPack(filename_noExt + ".dat", data)
  File "/home/jeroen/rocm_sdk_builder/builddir/025_02_hipBLASLt/virtualenv/lib/python3.11/site-packages/Tensile/LibraryIO.py", line 82, in writeMsgPack
    msgpack.pack(data, f)
    ^^^^^^^
NameError: name 'msgpack' is not defined
make[2]: *** [library/CMakeFiles/TENSILE_LIBRARY_TARGET.dir/build.make:74: Tensile/library/TensileManifest.txt] Error 1
make[2]: *** Deleting file 'Tensile/library/TensileManifest.txt'
make[2]: Leaving directory '/home/jeroen/rocm_sdk_builder/builddir/025_02_hipBLASLt'
make[1]: *** [CMakeFiles/Makefile2:249: library/CMakeFiles/TENSILE_LIBRARY_TARGET.dir/all] Error 2
make[1]: Leaving directory '/home/jeroen/rocm_sdk_builder/builddir/025_02_hipBLASLt'
make: *** [Makefile:166: all] Error 2
build failed: hipBLASLt

After sourcing the ROCm dir and issuing a pip install msgpack the build continues. Oddly there seems to be no earlier import error.

jeroen-mostert commented 2 months ago

After all that we have a liftoff -- of sorts (I shifted the CPU around to be first in the test):

Pytorch version: 2.4.1-rc1
dot product calculation test
tensor([[[-0.4069,  0.4298,  0.1421,  0.2981, -0.6793, -0.6672,  0.8160,
           0.9350],
         [-0.0467,  0.3639,  0.3358, -0.3055, -2.6535, -0.5998,  1.0980,
          -0.0314],
         [-0.6545,  0.5696, -0.2677,  0.5366, -2.0820, -0.0511,  0.2166,
          -0.0807]],

        [[ 0.2281, -1.0593, -0.0645, -0.5195, -0.3865, -0.3296, -0.3891,
          -0.5951],
         [ 0.2904, -1.0216, -0.0712, -0.7817, -0.1596, -0.4118, -0.3010,
          -0.5868],
         [ 0.7638, -0.6250,  0.5318, -1.1072,  0.7062, -1.0600, -1.1464,
           0.1632]]])

Benchmarking cuda and cpu with Default, Math, Flash Attention amd Memory pytorch backends
Device:  AMD Ryzen 5 7600 6-Core Processor / cpu
    Default benchmark:
        1743619.913 microseconds, 1.7436199129988381 sec
    SDPBackend.MATH benchmark:
        1578972.417 microseconds, 1.5789724169990222 sec
    SDPBackend.FLASH_ATTENTION benchmark:
        1730642.304 microseconds, 1.7306423039990477 sec
    SDPBackend.EFFICIENT_ATTENTION benchmark:
    SDPBackend.EFFICIENT_ATTENTION cpu is not supported. See warnings for reasons.
Device: AMD Radeon RX 6800 XT / cuda:0
    Default benchmark:
        3918.979 microseconds, 0.0039189792333975985 sec
    SDPBackend.MATH benchmark:
        2927.380 microseconds, 0.002927380217337539 sec
    SDPBackend.FLASH_ATTENTION benchmark:
        3925.807 microseconds, 0.003925806716506486 sec
    SDPBackend.EFFICIENT_ATTENTION benchmark:
        3892.296 microseconds, 0.0038922958499824738 sec
Device: AMD Radeon Graphics / cuda:1
    Default benchmark:

The benchmark never completes; after 6 minutes I terminated the process. I could see the iGPU was under 100% load the entire time, but either things were stuck in a loop or else performance is pathological. While the gfx1036 has only 2 CUs (to the gfx1030's 72) it clearly should not be this slow. No errors were reported in the kernel log.

Modifying the test to only use the iGPU produces something interesting:

Pytorch version: 2.4.1-rc1
dot product calculation test
tensor([[[0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0.]]], device='cuda:1')

So even the initial dot product test fails, albeit with a zero result rather than a hang. Some more testing reveals that torch.randn fails as well; rather than generating random numbers, the tensors remain empty. Tracing with AMD_LOG_LEVEL=3 revealed no obvious smoking gun.

I don't really intend to investigate this further (at least not any time soon, it would be on the "if I'm bored" list, but that list is in no real danger of getting processed :P)

lamikr commented 2 months ago

I'm starting to think all the subcommands of babs are basically something you should never use unless you really are a "power user". :P For example, I got a bunch of errors on babs.sh -co. I was a bit mystified until I realized that I hadn't done a babs.sh -f before, meaning my sources were not up to date, meaning -co tried to

For me the ./babs.sh -f , ./babs.sh -co and ./babs.sh -ap works usually ok, only problem I noticed when jumping between the pytorch 2.3.1 and 2.4.1. But I agree that "-co" should do the "-f" automatically for the project if it fails to checkout.

I would love to have somekind of automatic command in git to restore itself and submodules to original state deleting all extra files. Badically doing for the git repo: "rm -rf " "git --reset --hard" and "git submodule update --init --recursive". I do not want to add the "rm -rf " to babs.sh because if something would go wrong...

warning: 01ac329aacf5ad74689d216418f000f698d37105:.gitmodules, multiple configurations found for 'submodule.third_party/eigen.ignore'. Skipping second one!

hmm, that's the stuff I do not know in git...

lamikr commented 2 months ago

Device: AMD Radeon RX 6800 XT / cuda:0 Default benchmark: 3918.979 microseconds, 0.0039189792333975985 sec SDPBackend.MATH benchmark: 2927.380 microseconds, 0.002927380217337539 sec SDPBackend.FLASH_ATTENTION benchmark: 3925.807 microseconds, 0.003925806716506486 sec SDPBackend.EFFICIENT_ATTENTION benchmark: 3892.296 microseconds, 0.0038922958499824738 sec

Your results seems to be sligtly better what I get from my rx 6800. benchmarks/results/rocm_sdk_612/pytorch_241/gfx1030/20240809_231740_pytorch_dot_products.txt

The benchmark never completes; after 6 minutes I terminated the process. I could see the iGPU was under 100% load the entire time, but either things were stuck in a loop or else performance is pathological. While the gfx1036 has only 2 CUs (to the gfx1030's 72) it clearly should not be this slow. No errors were reported in the kernel log.

So only 2 CU's on gfx1036? gfx1035 at least have 12.

Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 2048(0x800) KB
Chip ID: 5761(0x1681)
ASIC Revision: 2(0x2)
Cacheline Size: 128(0x80)
Max Clock Freq. (MHz): 2400
BDFID: 1024
Internal Node ID: 1
Compute Unit: 12
SIMDs per CU: 2
Shader Engines: 1
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4

Modifying the test to only use the iGPU produces something interesting:

Pytorch version: 2.4.1-rc1
dot product calculation test
tensor([[[0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0.]]], device='cuda:1')

Hmm, I went through all patches to add gfx1011/12/31/32 support and noticed that AMDMIGraphX was missing the gfx1036 on one supported gpu list. I pushed change to that now in updated patches list that is now in pull-request until I verify with my gpu's that I did not break anything.

So even the initial dot product test fails, albeit with a zero result rather than a hang. Some more testing reveals that torch.randn fails as well; rather than generating random numbers, the tensors remain empty. Tracing with AMD_LOG_LEVEL=3 revealed no obvious smoking gun.

That's fine. Btw. would you be able to test the audio examples I pushed yesterday to git-repo. Especially it would be interesting to know whether playback and microphone recording works on ffmpeg7. I used pulse as a target device, as that was listed by "ffmpeg -devices" on my distros even though they in reality uses now pipewire.

I initially made the playback and microphone recording/playback tests on Fedora 40 which has pipewire and ffmpeg6 and they kind of worked ok. Originally pytorch audio supported only the mac for real playback, but I added small patch that added pulse support as a playback device.

Only problem I noticed was that the playback of this small "./speech.wav" was always cutted little to early. (So I could hear about 2 seconds from 3 second speech). I noticed that if I do the playback with ffmpeg from console, it procuced the same results. I think the command I used from terminal to testing was (fmpeg -i speech.wav -f pulse "pulse")

On Mageia where I have ffmpeg5/pipewire, the playback does not work similar way than on the fedora40. I only hear the original speech.wav that I play first and then the last effected version. All other playback's are silent.

And the other example which records couple of time from microphone and then plays that back with some effects applied does also not work on Mageia. It complaints that it could not open file "-" that I give as a second parameter in addition of "-f pulse". There seems to be differences between ffmpeg versions of accepting parameters which are also otherway mystical and not well documented. Only semihelpful thing that I found by googling was

https://discuss.pytorch.org/t/how-to-use-use-streamreader-in-linux/175582 In the long run, I think SDL2 could be better choice for audio record and playback than ffmpeg for pytorch audio in Linux. I checked that it should not be too hard to implement that,

jeroen-mostert commented 2 months ago

I would love to have somekind of automatic command in git to restore itself and submodules to original state deleting all extra files. Badically doing for the git repo: "rm -rf " "git --reset --hard" and "git submodule update --init --recursive". I do not want to add the "rm -rf " to babs.sh because if something would go wrong...

Something like git reset --hard ; git submodule deinit -f --all; git clean -fdx. Disclaimer: I'm not a git expert either, but especially git clean -fdx is pretty damn thorough.

So only 2 CU's on gfx1036? gfx1035 at least have 12.

Yes, this is not a typo. The iGPUs on Ryzen 4+ desktop processors have been nerfed to the point where they are good enough for basic display purposes, and compute only if you're desperate. On mobile it's a different story.

As for the audio stuff, we should probably open a new issue/discussion at this point, this one is getting bloated enough as it is. :P My original concern for building hipBLASLt has been adequately answered and the remainder mostly concerns pytorch.

lamikr commented 2 months ago

I planned to close this but just noticed that pytorch 2.4.1rc branch has now fix for the hibBLASLt problem :-) So I pushed the update to pytorch githash to get that fix included. (pytorch commit 7e0ef343b0f0214a3ba6b620cfe7ad59bfd9afab)

Selected backend can be controlled by api call torch.backends.cuda.preferred_blas_library(backend="cublaslt"). On rocm-environment the cublas/cublaslt is intrerpreted as a hipblas/hipblaslt.

If the users gpu support has not been build into hipblaslt, it should now automatically then fall back to hipblas. So to my understanding with that fix applied, the hipblaslt is not anymore a hard requirement and we should add the check to hipblaslt binfo to filter away from the build list all gpu's which are not supported. And if the list would be empty after that, then we would build the hipblaslt with options:

-DBUILD_WITH_TENSILE=OFF -DAMDGPU_TARGETS=""

--> pytorch on gfx1030 would fall back to using hipBLAS.

I am trying to create some tests now to verify this both on old pytorch without fix and with the new pytorch with the fix included. I quess, I need to build also the hipblaslt with this tensile=off option to be able to verify this.

lamikr commented 2 months ago

Ok, I was not able to do a "dummy" hipBLASLt build without support for any gpu included. (-DBUILD_WITH_TENSILE=OFF -DAMDGPU_TARGETS="" option did not work as the code had dependencies for Tensilelite headers and other stuff and when investigating the problem I could not figure out simple fix not requiring to add patches in many different places)

So, current version will now filter away the user selected gpu's which are not supported by the hipBLASLt and if the list is after that empty, it will then build for gfx90a only and disable the building of examples and benchmarks.

I tested that the latest pytorch version works now ok by switching between hipBLAS and hipBLASLt backend depending whether the hipBLASLt has support for your gpu or not. (Default being cuBLAS/hipBLAS)

I also dropped now some patches which made hipBLASLt to accept some other gpus which were anyway not clearly supported. It will help the maintenance of hipBLASLt patches much easier when updating to rocm 6.2.0 in future.

I think we can now close this one.

jeroen-mostert commented 2 months ago

I concur. With this and the other patches for (ao)triton, compile times have gone way down since the build is much more targeted.

I will probably not get back to troubleshooting the gfx1036 issues; I think I'd prefer to leave these until and if someone who also wants to use gfx1036 complains. For me it was more of a testing thing, and the conclusion of my tests was that leveraging the iGPU of desktop Zen 4 is simply not worth it, in almost every scenario. Even when it works properly it will usually not outpace a proper multi-threaded CPU implementation (and this is on the lowest entry CPU that has an iGPU, the 7600), and using it in a multi-GPU setup with a discrete GPU is almost a non-starter since it becomes a massive bottleneck.

lamikr / rocm_sdk_builder

hipBLASLt builds for gfx90a? #123