sgminer-dev / sgminer

Scrypt GPU miner
GNU General Public License v3.0
631 stars 826 forks source link

Weird errors - BAMT 1.3, Sapphire R9 280X cards #162

Closed davidhq closed 10 years ago

davidhq commented 10 years ago

Hi,

I was testing sgminer with nscrypt support and I had a lot of problems and I'm not sure if others are experiencing similar issues or not...

I have tested it on 3 rigs:

Rig 1 I could get only 250 KH hashrates on most restarts... sometimes it worked ok.

Rig 2 I get 700 KH (?) - very strange because around 340 KH is normal... also a lot of HW errors

Rig 3 When starting sgminer with nscrypt, I got this a lot of times: GPU 1: | OFF / 0.000h/s | R: 0.0% HW:0 WU:0.0/m I:13 GPU 2: | OFF / 0.000h/s | R: 0.0% HW:0 WU:0.0/m I:13 GPU 3: | OFF / 0.000h/s | R: 0.0% HW:0 WU:0.0/m I:13 GPU 4: | 0.000/ 0.000h/s | R: 0.0% HW:0 WU:0.0/m I:13 GPU 5: | OFF / 0.000h/s | R: 0.0% HW:0 WU:0.0/m I:13

Also on quitting sgminer (on scypt or nscrypt), it would just hang and I had to reboot the machine.

Please let me know if someone is also having similar troubles...

PS: the current master segfaults

veox commented 10 years ago

What hash rates are you expecting? Which sgminer version is bundled with BAMT 1.3?

Current master has just been fixed.

davidhq commented 10 years ago

I'm expecting around 340 KH (I mentioned it above)

not sure which cgminer is bundled originally with BAMT 1.3 - searched online and it seems that CGMiner 3.7.2, but not sure.

veox commented 10 years ago

So are you using sgminer or cgminer? The latter is not relevant here.

davidhq commented 10 years ago

I'm now using your sgminer but you asked what is bundled with BAMT 1.3. I think there is only cgminer and that it's 3.7.2.

veox commented 10 years ago

Ah, okay. I remember sgminer having been included in some version of BAMT (the Litecoin variant).

As to the issue, see doc/BUGS.md as to what's relevant. Mostly your config and -TDv output.

davidhq commented 10 years ago

Thank you, I'll work on that in a few days. So I send you the information I get with those options?

veox commented 10 years ago

It is best you post them online and provide links. I recommend gist.github.com, since you already have an account here. Usual precautions apply (no username/password).

ikerrg commented 10 years ago

I have a Sapphire R9 280X Dual-X rig (two graphics cards) and I'm using BAMT 1.3 with updated sgminer (4.1.153). No problems so far, I've been using sgminer for two months and updating regularly with latest git versions, and I have had no problems. Indeed, now I mine scrypt and nscrypt and it works perfectly in my BAMT distro. Maybe you are having a hardware problem (memory, graphicscards) or the problem is in the number of graphics cards. Good luck.

davidhq commented 10 years ago

Thank you for the note... To me it happens on 3 different rigs (each has 6 cards). Also one rig has regular cards (not Dual-X) and other 2 are Dual-X... so it's probably not that all the cards are faulty in some similar way because they're from the same series.

When I come home this weekend I'll test again with new master code and will report info as the main developer suggested.

davidhq commented 10 years ago

Since installing the last master, I had much less issues... but still sometimes strange things happen. For example now I cannot start sgminer after quitting it:

-TDv gives this (since it doesn't start, I don't know if this is of much help):

user@miner1:~$ /opt/miners/cgminer/sgminer -TDv -c /etc/bamt/sgminer.conf [20:21:35] Global quota greatest common denominator set to 1 [20:21:35] Global quota greatest common denominator set to 1 [20:21:35] Global quota greatest common denominator set to 1 [20:21:35] Global quota greatest common denominator set to 1 [20:21:35] Global quota greatest common denominator set to 1 [20:21:35] Started sgminer 4.1.153-82-g9a2b-dirty [20:21:35] Loaded configuration file /etc/bamt/sgminer.conf

Config: http://pastebin.com/zVr1eerF

UPDATE: I restarted the rig and sgminer still won't start :(((

veox commented 10 years ago

Did you do a full rebuild (including autoreconf -fi, etc.)? Also, the -dirty in version string means there have been changes to the code since last make clean. Did you make any changes?

scrypt in your config is unnecessary.

Run with -TDv 2> log.txt. If the instance exits, and the next one doesn't, you will have the log of the previous instance to see if anything went wrong at shutdown.

Also check that the previous instance exited cleanly (there are no sgminer processes left hanging).

davidhq commented 10 years ago

I got it to start after 2nd reboot - but I almost immediately got 0 hashrate on one of the cards... Then I ran it for a few seconds with -TDv, results: http://pastebin.com/t4X8VycU

davidhq commented 10 years ago

I didn't make changes, not sure how -dirty got in there... but I did use poolalgo branch on mrbrdo fork... which was rebased master. I just ran make clean and then everything else... and it still has -dirty flag. I didn't run autoreconf with "f" param before... now I did - what does it mean?

Now it started, but I already got HW errors:

GPU 0: | 647.9K/532.6Kh/s | R: 0.0% HW: 4 WU: 507.5/m I:13 GPU 1: | 740.6K/679.0Kh/s | R: 0.0% HW:34 WU: 300.2/m I:13

PS: I'm now here with the rigs... I live 1h away from the place so it would be great to test everything I can now.. I see you're online (you responded to my first msg). Anything else I need to do?

veox commented 10 years ago

I didn't run autoreconf with "f" param before... now I did - what does it mean?

Force. See man autoreconf.

veox commented 10 years ago

Reading through the log:

[20:45:13] ADL initialisation error: -21 (No Linux XDisplay in Linux Console environment)                    
[20:45:13] WARNING: GPU_MAX_ALLOC_PERCENT is not specified!                    
[20:45:13] WARNING: GPU_USE_SYNC_OBJECTS is not specified! 

Do you

export DISPLAY=:0
export GPU_MAX_ALLOC_PERCENT=100
export GPU_USE_SYNC_OBJECTS=1

before running sgminer, either manually or in a run script? (This is in doc/MINING.md).

veox commented 10 years ago

Then, you have "_url" in your config (still reading the log...).

davidhq commented 10 years ago

_url is just a way to ignore this entry (what bo be the best way to do this?) regarding other 3 options: I have this in a script I normally use:

export DISPLAY=:0
export GPU_USE_SYNC_OBJECTS=1
export GPU_MAX_ALLOC_PERCENT=100

but when I created this log, I ran sgminer manually so they were not present :/ I can rerun it if needed.

veox commented 10 years ago

_url is just a way to ignore this entry...

For future reference: you can use "state":"disabled". (EDIT: this is poorly documented, true.)

Try re-running.

Is there anything else in your launch script?

Are there any other mining processes that BAMT may be automatically starting?

veox commented 10 years ago

Also, what Catalyst and AMD APP SDK versions are being used? The usual doc/BUGS.md stuff.

EDIT: sorry, I have to go AFK now. I was already on my way when you wrote an hour ago.

davidhq commented 10 years ago

Start scripts: http://pastebin.com/Xhqe3NEe http://pastebin.com/E5T0J2aX

those are originally in BAMT... but I have tried running sgminer manually and it also misbehaved in the same way.

Looking for Catalyst and AMD APP SDK versions... this is also from original BAMT 1.3 package. Searching where to find this info on the machine...

davidhq commented 10 years ago

I found on the net that BAMT 1.3 uses Catalyst 13.11... 1.4 has AMD-APP-SDK-v2.9 .. I cannot find the info for 1.3, but it's some earler version than 2.9

Are these too old?

veox commented 10 years ago

No, they're not old.

You should try sgminer -n to list your devices, see if they are identical. You're specifying only one engine/memclock/powertune each, it is possible not every card supports them. The hang in your original report could be related.

You're also using no-pool-disable, it's been renamed to disable-rejecting.

Try not setting shaders.

BTW, git master needs algorithm set for adaptive N factor CCs, see doc/configuration.md.

Other than that, I've run out of ideas.

Also, all of this is hardly reproducible.

davidhq commented 10 years ago
[23:53:39] CL Platform 0 vendor: Advanced Micro Devices, Inc.                    
[23:53:39] CL Platform 0 name: AMD Accelerated Parallel Processing                    
[23:53:39] CL Platform 0 version: OpenCL 1.2 AMD-APP (1348.4)                    
[23:53:39] Platform 0 devices: 6                    
[23:53:39]  0   Tahiti                    
[23:53:39]  1   Tahiti                    
[23:53:39]  2   Tahiti                    
[23:53:39]  3   Tahiti                    
[23:53:39]  4   Tahiti                    
[23:53:39]  5   Tahiti                    
[23:53:39] Number of ADL devices: 6                    
[23:53:39] ADL initialisation error: -21 (No Linux XDisplay in Linux Console environment)                    
[23:53:39] 6 GPU devices max detected

I have corrected the options you are suggesting. It's late now, going home so I won't test anymore... maybe some of these things will resolve the issues but somehow I doubt it.. and it really is strange I seem to be one of small (or zero?) number of people experiencing this... and I have these problems on most of my rigs (I have 6). Very strange. If you have some other idea in the future, I'd like to hear. I will also report if I found out the issue or if your parameter suggestions worked... but this is very stressful :( esp. since I live more than an hour away from where my rigs are and often I have to come because of the issues.

Anyway thank you for helping!! I hope that in the end this will be good for something and we'll be smarter :) There has to be a reason for this behaviour somewhere and it's probably not some bad GPUs.

davidhq commented 10 years ago

hmm do you think that when I flashed the cards to undervolt them - could this introduce some problems? But I have a friend with identical cards and he flashed them in the same way and he doesn't have such problems.

veox commented 10 years ago

You need the same envvars exported for sgminer -n to work.

Not sure about (re)flashing, never tried it.

I can only suggest testing with other software (cgminer-kalroth, bfgminer, vanilla cgminer 4.7.2).

davidhq commented 10 years ago

Hi!

Have some new info.... I now got SSD disk instead of USB drive.

Installed Ubuntu 12.04.4-desktop-amd64, amd-catalyst-13.12-linux-x86.x86_64, AMD-APP-SDK-v2.8.1.0-lnx64, latest sgminer master

findings:

-n says:

ADL index 0, id 32699 - FAILED to get BIOS info [20:23:35] Failed to ADL_Adapter_ID_Get. Error -1 .... [for all the cards the same] ADL found less devices than opencl! [20:23:35] There is possibly more than one display attached to a GPU [20:23:35] Use the gpu map feature to reliably map OpenCL to ADL

Cannot find solution to this by googling. Normal mining doesn't seem to report such things.

I did experience very bad issues again though: once upon starting sgminer, nothing happened and I also lost the ability to ssh to the machine... had to hard-reboot.

Later when it mined, there was always a problem with the first card: http://cl.ly/image/2f0l2s3K2n2D

Also when I quit sgminer by pressing q, most of the times it hangs lke this: http://cl.ly/image/2i002N140a1o

I tried commit f2934d8afd4f0fdce7597e2b1f38a7a29337e5d3 (friend suggested, known to be stable at his rigs for sure) as well.

Should I slowly give up on this and really try other alternatives? I'll try now just to see what happens... I would really like to use sgminer though.

If you are close and have time, I'd appreciate some comment ASAP if possible.. like last time.

thank you! david

veox commented 10 years ago

I recommend trying other software. If it exhibits similar behaviour, then it is most probably an issue with your hardware or software setup.

davidhq commented 10 years ago

The problem was faulty cards. In BAMT 1.6.2 I was able to run cgminer which showed 0 hashrate for faulty cards (or showed them as SICK). After removing them, sgminer works...

The cards were distributes so that 4 out of 6 rigs had 1 of them (2 in one case).