Trying to use the default neoscrypt kernel with latest amdgpu-pro (17.50) drivers causes a crash

mark9064 commented 6 years ago

System: H81-PRO-BTC Celeron g1830 4gb ddr3 1600W evga supernova g2 6x rx570 Ubuntu server 16.04.3 (kernel 4.10) Repro steps: Install latest drivers and the amd app sdk Compile nsgminer Run neoscrypt with the default kernel (neoscrypt) Crash info: Error message printed: In hsa_operand section, at offset 3552: Address offset exceeds variable size LLVM ERROR: Brig container validation has failed in BRIGAsmPrinter.cpp Using the neoscrypt_vilw kernel works ok but only yields about 550kh/s on each card (with bios mods/ oc on each card) This same crash also occurs if trying to use sgminer (I don't know what kernel it uses by default but judging by the almost identical crash message (only the offset number is different) it uses the neoscrypt kernel too) I have seen issue reports for this on sgminer too but the genesis mining fork is no longer maintained and nicehash decided to completely remove neoscrypt from their fork when the bug was reported

ghostlander commented 6 years ago

Downgrade to v17.40 or older. I don't know what AMD have broken in their drivers again.

mark9064 commented 6 years ago

Sure will downgrade soon. Closing for now.

mark9064 commented 6 years ago

Didn't wanna reopen this, but using 17.40 drivers and no luck either I reinstalled AMD-APP-SDK and recompiled nsgminer after the new drivers

ghostlander commented 6 years ago

It works for me with v17.40. Maybe use an older SDK like v2.9.

mark9064 commented 6 years ago

Sure will do. But i have noticed that I can't find the AMD APP SDK anywhere. AMD's website gives me expired certificate errors and then 404s on the APP SDK page. I wonder what’s up. If you have the installer for the 2.9 version you know works that would be great

ghostlander commented 6 years ago

https://github.com/ghostlander/AMD-APP-SDK

mark9064 commented 6 years ago

thanks dude, trying it now

mark9064 commented 6 years ago

uninstalled sdk 3 and installed 2.9.1, backend error will clDevicesNum, running clinfo and the cards dont show up at all. any ideas? all miners fail to launch now

ghostlander commented 6 years ago

It happens if the SDK has installed libraries which it really hasn't been supposed to. The installer can detect fglrx only, not amdgpu-pro. Most likely the CPU only OpenCL stuff has overwritten the amdgpu-pro stuff. Remove these libOpenCL and libamdocl libraries, reinstall amdgpu-pro.

mark9064 commented 6 years ago

sure ill reinstall drivers, where are these libs i need to remove gonna be found

ghostlander commented 6 years ago

Maybe under /opt/AMDAPPSDK-2.9-1/lib

mark9064 commented 6 years ago

ill pull out the symlinks to /usr/lib first then. thanks for the quick support man, i appreciate this so much

mark9064 commented 6 years ago

ok reinstalled drivers all ok, clinfo detecting all cards and im back at the original issue :(

ghostlander commented 6 years ago

Since this is an LLVM error, I could also suggest to use old good GCC v4.x instead.

mark9064 commented 6 years ago

ok ill look into that tomorrow. late here ;D

anddmx commented 6 years ago

@mark9064

I recompiled with GCC 4.9 from GCC 5, and I'm still getting the same error.

mark9064 commented 6 years ago

hmmm strange

anddmx commented 6 years ago

I am using the new beta linux mining driver 17.40. It's got to be a driver issue. Sgminer is giving the same error for neoscrypt.

On Wed, Jan 3, 2018, 1:55 AM mark9064 notifications@github.com wrote:

hmmm strange

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ghostlander/nsgminer/issues/28#issuecomment-354973291, or mute the thread https://github.com/notifications/unsubscribe-auth/AIGiSVhrjHyDcjMtkydGNr3kt2rsmWRhks5tG057gaJpZM4RQG8N .

mark9064 commented 6 years ago

yup, its cause sgminer uses the same kernel as nsgminer (i think)

mark9064 commented 6 years ago

any ideas ghost?

ghostlander commented 6 years ago

No, SGminer employs Wolf0's NeoScrypt kernel.

I'm using amdgpu-pro v17.40 with GCC v5.4.0 on Ubuntu 16.04 with the default 4.10 kernel.

Number of platforms: 1 Platform Profile: FULL_PROFILE Platform Version: OpenCL 2.0 AMD-APP (2482.3) Platform Name: AMD Accelerated Parallel Processing Platform Vendor: Advanced Micro Devices, Inc. Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

The AMD APP SDK shouldn't really be an issue because the miner comes with the v2.9 headers which seem to be alright.

mark9064 commented 6 years ago

my clinfo literally returns the exact same, letter for letter. i wondered whether sgminer's kernel was the same as they crash in the exact same way, just with a different offset. i am using ubuntu 16.04 (server, so no xorg) with the 4.10 kernel too along with amdgpupro 17.40 as well as gcc 5.4.0.

exactly the same system setup???

do you think it could be different hardware causing the issue (using rx570s 4GB here)

ghostlander commented 6 years ago

It works with or without Xorg. I don't think there is much difference between RX480 and RX570, but who knows what their compiler does.

mark9064 commented 6 years ago

their shouldnt be any difference. both polaris cards. any way i can provide more info to help the problem here?

ghostlander commented 6 years ago

The only difference I can think of is the number of GPUs. Could you disable all of them except one either in software or hardware?

mark9064 commented 6 years ago

i can unplug them all :wink:

mark9064 commented 6 years ago

running 1 gpu only, no difference i wonder why the neoscrypt kernel has problems but the vilw kernel doesnt

mark9064 commented 6 years ago

what libllvm are you on? my system just prompted me to update but ill hold for now... currently on 4.0, prompting 5.0

ghostlander commented 6 years ago

The VLIW kernel doesn't use local (shared) memory. Branching reduced to the minimum. It's more straightforward which results in higher register usage and larger kernel size.

libllvm 4.0

mark9064 commented 6 years ago

so you reckon its something to do with shared memory? what could be causing issues with that?

ghostlander commented 6 years ago

It has something to do with poor AMD compiler quality. Used to be much better in the past.

mark9064 commented 6 years ago

so, do you think that this is easily fixable? would it be possible to see what the neoscrypt kernel from nsgminer and sgminer have in common to see what's causing the error? also what is the difference between the vilw and the vilwp kernel?

ghostlander commented 6 years ago

VLIWp is another implementation of VLIW with Salsa and ChaCha running in parallel rather than sequence. May or may not deliver better performance.

I have other priorities at the moment rather than working around AMD bugs once again. Just pick a kernel that works.

anddmx commented 6 years ago

@ghostlander

Using neoscrypt_vliw kernel fixed my issue!

Thank you

ghostlander / nsgminer

Trying to use the default neoscrypt kernel with latest amdgpu-pro (17.50) drivers causes a crash #28