justxi / rocm

Ebuilds to install ROCM on Gentoo Linux
38 stars 23 forks source link

Degradated eth hashrate gpu vega 56 between rocm stack 2.10 and 3.+ #172

Closed perestoronin closed 4 years ago

perestoronin commented 4 years ago

Please help me return 20%+ in tests for vega 56 lost in upgrade rocm stack from 2.10 to 3.*

This tools not work for new stack and new drivers https://github.com/Eliovp/amdmemorytweak/issues/46

justxi commented 4 years ago

I think this is a questions for -> https://github.com/RadeonOpenCompute/ROCm/issues ?

perestoronin commented 4 years ago

I think this is a questions for -> https://github.com/RadeonOpenCompute/ROCm/issues ?

added: https://github.com/RadeonOpenCompute/ROCm/issues/1246

not only for rocm 3.*, but also for new amdgpu-pro, same issue :( https://github.com/Eliovp/amdmemorytweak/issues/46

a-repko commented 4 years ago

@perestoronin I'm using older versions of ROCm stack as well (each version has different problems, so I needed to select one, which is best for my purposes). I noticed that these versions were removed from Gentoo repository, so I decided to collect all relevant ebuilds and make them available also for others (versions 2.10, 3.0, 3.1, 3.3, 3.5, 3.7 and 3.8).

Here you are: ebuilds, and for your convenience also distfiles (30 MiB, except llvm-roc (too large) - which can be automatically downloaded by emerge anyway)

I'm attaching the ebuilds also locally here: rocm_ebuilds.tar.gz. You will need to replace corresponding subdirectories in /var/db/repos/gentoo/ (since if you keep various ...-9999.ebuild files, then emerge will complain). OK, this approach is quite raw, but hopefully should work.

BTW: problems with memory overclock can also be related to kernel driver, because, in fact, you don't need the ROCm stack to do it - the rocm-smi seems to be just a python script that communicates directly to kernel (or its /sys interface)

perestoronin commented 4 years ago

@perestoronin I'm using older versions of ROCm stack as well (each version has different problems, so I needed to select one, which is best for my purposes). I noticed that these versions were removed from Gentoo repository, so I decided to collect all relevant ebuilds and make them available also for others (versions 2.10, 3.0, 3.1, 3.3, 3.5, 3.7 and 3.8).

Here you are: ebuilds, and for your convenience also distfiles (30 MiB, except llvm-roc (too large) - which can be automatically downloaded by emerge anyway)

I'm attaching the ebuilds also locally here: rocm_ebuilds.tar.gz. You will need to replace corresponding subdirectories in /var/db/repos/gentoo/ (since if you keep various ...-9999.ebuild files, then emerge will complain). OK, this approach is quite raw, but hopefully should work.

BTW: problems with memory overclock can also be related to kernel driver, because, in fact, you don't need the ROCm stack to do it - the rocm-smi seems to be just a python script that communicates directly to kernel (or its /sys interface)

Thank you for old ebuilds 2.10, in my described case with rocm driver 2.10 was 44+Mh, but with case rocm drivers 3.8 or linux kernel drivers 5.8.11 now 36 Mh in ethminer, but estimated in both cases 50Mh.

used overclock scripts: vega56-50-all vega56-50 show.sh tweak50.sh.work https://gist.github.com/raw/2eb3345074fe5141219c714301f98543

amdmeminfo: Found Card: 1002:687f rev c3 (AMD Radeon RX Vega 56) Chip Type: Vega10 BIOS Version: 113-D0500300-102 PCI: 16:00.0 OpenCL Platform: 0 OpenCL ID: 4 Subvendor: 0x1002 Subdevice: 0x0b36 Sysfs Path: /sys/bus/pci/devices/0000:16:00.0 Memory Type: HBM2 Memory Model: Samsung KHA843801B

rocm-smi --showdriverversion : Driver version: 5.8.11-gentoo

PS. used /etc/portage/patches/dev-util/opencl-headers/rocm-opencl-headers.patch https://gist.github.com/raw/429ba545d2d42135dcc2121cce079777 to compile amdmeminfo from https://github.com/perestoronin/rocmnew/tree/master/dev-util/amdmeminfo

justxi commented 4 years ago

As mentioned above, this seems to be not related to the ebuilds. If there is a solution please let us know.

perestoronin commented 3 years ago

This not fixed for new kernel 5.7+, but can restored hashrate by downgrade linux kernel to 5.4.

a-repko commented 3 years ago

Hi, just in case, I'm posting here a new collection of ROCm ebuilds from version 2.10 up to 4.1: rocm_ebuilds.tar.gz These are mainly aimed at OpenCL; moreover, version 4.0 contains also rocm-smi, HIP and some additional ROC-machinery. Above-posted off-site links are updated as well.

A few comments about hardware support and (off-topic) undervolting are in order here:

Raven Ridge (APU series 2000G) worked well up to 3.3, and then again started to work from 3.10 up, see RadeonOpenCompute/ROCm#1219

Renoir (APU series 4000G) is still not working properly. Kernel 5.8.18 (ebuild) reports a correct number of CU, but newer kernels up to 5.11 are still adding +20CU, although there seem to be a slightly improving support in ROCm 3.10, 4.0, 4.1 with newer kernels. Version 4.1 in Gentoo is producing a lot of error messages which clutter the testing-program output. I recommend installing OpenCL from AMDGPU-PRO 20.40 by this script (versions newer than 20.40 are messy due to an added ROCm-derived OpenCL for Big Navi)

Radeon (Pro) VII is reportedly not supported in ROCm 4.1 with upstream kernel (the case of Gentoo)

Vega FE apparently cannot be undervolted by the human-readable /sys interface. You need to edit binary /sys/class/drm/card1/device/pp_table, as discussed in RadeonOpenCompute/ROCm#463. The main point is that SoC voltage values are stored first, and then sclk and mclk levels are referencing them, see the source code vega10_pptable.h. Radeon Pro VII cannot be undervolted neither by this binary interface (at least I didn't managed to do it; so the power consumption is ca. 25% higher than optimum).