sgminer-dev / sgminer

Scrypt GPU miner
GNU General Public License v3.0
631 stars 825 forks source link

freezes when using 14.6 beta bin files #323

Open mrbrdo opened 10 years ago

mrbrdo commented 10 years ago

Froze after all pools were down for a bit, GUI completely unresponsive, API timed out. ps shows process as defunct [sgminer] <defunct>.

[00:13:38] Accepted 03d6172b Diff 0.261/0.100 GPU 2 at NiceHash_X15_multi
[00:14:10] NiceHash_X11_multi alive, testing stability
[00:14:13] Accepted 085d50a2 Diff 0.120/0.100 GPU 3 at NiceHash_X15_multi
[00:14:25] Stratum connection to NiceHash_X11_multi interrupted
[00:15:51] Stratum connection to NiceHash_X15_multi interrupted
[00:16:01] Waiting for work to be available from pools.

might be realted to 14.6 beta drivers bin files

troky commented 10 years ago

Can you replicate this?

I am occasionally experiencing similar problem (frozen GUI...) on Windows but it isn't pool (dis)connection related. It happens randomly. I still can't trace the problem.

mrbrdo commented 10 years ago

Not sure what happens but today 2 out of 4 of my rigs froze. They weren't defunct this time but they were frozen and became defunct after trying to kill -9 them. Unfortunately I had ncurses enabled so my log is not that helpful, but here it is anyway. However, a 3rd miner also stopped working but it shut down gracefully with "thread 2 create failed". Also interestingly the only rig that did not freeze/shutdown was the one I am running in debug mode without ncurses, and also that rig has ASUS 280x instead of Sapphire (but this probably does not matter).

Miner1:

sgminer 4.2.2-259-g855a - Started: [2014-07-05 04:53:32] - [0 days 05:38:51]
--------------------------------------------------------------------------------
(5s):5.493M (avg):11.29Mh/s | A:1  R:0  HW:0  WU:0.149/m
ST: 2  SS: 0  NB: 893  LW: 13  GF: 0  RF: 0
Connected to NiceHash_X15_multi (stratum) diff 0.100 as user 15bULC8snaKAMeFb3xBmmhbWj1xyTmBUfm
Block: 64331e33...  Diff:280  Started: [10:32:24]  Best share: 0.000
--------------------------------------------------------------------------------
[P]ool management [G]PU management [S]ettings [D]isplay options [Q]uit
GPU 0:  64.0C 1365RPM | 3.821M/ 0.000h/s | R:  0.0% HW:0 WU:0.000/m I:18
GPU 1:  59.0C 2233RPM | 3.852M/ 0.000h/s | R:  0.0% HW:0 WU:0.000/m I:18
GPU 2:  56.0C 2163RPM | 3.844M/ 0.000h/s | R:  0.0% HW:0 WU:0.000/m I:18
--------------------------------------------------------------------------------
[10:30:44] Accepted 2d5e65f9 Diff 0.022/0.019 GPU 1 at Trademybit_X11_multi
[10:30:48] Accepted 1b0fa7e1 Diff 0.037/0.019 GPU 0 at Trademybit_X11_multi
[10:30:48] Accepted 2d3ce846 Diff 0.022/0.019 GPU 0 at Trademybit_X11_multi
[10:30:56] Accepted 02d73453 Diff 0.352/0.019 GPU 2 at Trademybit_X11_multi
[10:31:02] Accepted 1c07cc0a Diff 0.036/0.019 GPU 1 at Trademybit_X11_multi
[10:31:06] Accepted 196ba149 Diff 0.039/0.019 GPU 1 at Trademybit_X11_multi
[10:31:13] Accepted 0915d3cb Diff 0.110/0.019 GPU 1 at Trademybit_X11_multi
[10:31:21] Accepted 2a28726e Diff 0.024/0.019 GPU 0 at Trademybit_X11_multi
[10:31:22] Accepted 347db203 Diff 0.019/0.019 GPU 0 at Trademybit_X11_multi
[10:31:23] Accepted 24b6cd30 Diff 0.027/0.019 GPU 1 at Trademybit_X11_multi
[10:31:32] Accepted 1c277cf8 Diff 0.036/0.019 GPU 1 at Trademybit_X11_multi
[10:31:35] Accepted 2744004f Diff 0.025/0.019 GPU 2 at Trademybit_X11_multi
[10:31:35] Accepted 1104fde8 Diff 0.059/0.019 GPU 1 at Trademybit_X11_multi
[10:31:40] Accepted 249e9fc4 Diff 0.027/0.019 GPU 0 at Trademybit_X11_multi
[10:31:57] Accepted 0d601e54 Diff 0.075/0.019 GPU 2 at Trademybit_X11_multi
[10:32:01] Accepted 3449f76e Diff 0.019/0.019 GPU 0 at Trademybit_X11_multi
[10:32:12] Stratum connection to Trademybit_X11_multi interrupted
[10:32:12] Trademybit_X11_multi not responding!
[10:32:12] Switching to Trademybit_X15_multi
[10:32:13] Trademybit_X15_multi not responding!
[10:32:13] Switching to NiceHash_X15_multi
[10:32:13] NiceHash_X15_multi difficulty changed to 0.100
[10:32:22] Applying pool settings for NiceHash_X15_multi...
[10:32:23] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[10:32:23] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[10:32:23] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[10:32:23] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[10:32:23] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[10:32:23] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[10:32:24] NiceHash_X15_multi extranonce change requested
sgminer 4.2.2-259-g855a - Started: [2014-07-05 04:53:32] - [0 days 05:38:51]
--------------------------------------------------------------------------------
(5s):5.493M (avg):11.29Mh/s | A:1  R:0  HW:0  WU:0.149/m
ST: 2  SS: 0  NB: 893  LW: 13  GF: 0  RF: 0
Connected to NiceHash_X15_multi (stratum) diff 0.100 as user 15bULC8snaKAMeFb3xBmmhbWj1xyTmBUfm
Block: 64331e33...  Diff:280  Started: [10:32:24]  Best share: 0.000
--------------------------------------------------------------------------------
[P]ool management [G]PU management [S]ettings [D]isplay options [Q]uit
GPU 0:  64.0C 1365RPM | 3.821M/ 0.000h/s | R:  0.0% HW:0 WU:0.000/m I:18
GPU 1:  59.0C 2233RPM | 3.852M/ 0.000h/s | R:  0.0% HW:0 WU:0.000/m I:18
GPU 2:  56.0C 2163RPM | 3.844M/ 0.000h/s | R:  0.0% HW:0 WU:0.000/m I:18
--------------------------------------------------------------------------------
[10:32:13] Trademybit_X15_multi not responding!
[10:32:13] Switching to NiceHash_X15_multi
[10:32:13] NiceHash_X15_multi difficulty changed to 0.100
[10:32:22] Applying pool settings for NiceHash_X15_multi...
[10:32:23] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[10:32:23] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[10:35:22] Trademybit_X15_multi alive, testing stability
[10:42:46] Trademybit_X11_multi alive, testing stability
[11:47:05] Trademybit_X13_multi alive, testing stability
[13:25:00] NiceHash_X11_multi alive, testing stability
[10:34:02] NiceHash_X15_multi extranonce change requested
[19:22:51] NiceHash_X15_multi difficulty changed to 0.008
[19:22:51] NiceHash_X15_multi difficulty changed to 0.044

Miner2:

miner2

sgminer 4.2.2-259-g855a - Started: [2014-07-05 04:54:06] - [0 days 04:20:49]
--------------------------------------------------------------------------------
(5s):0.000 (avg):0.000h/s | A:0  R:0  HW:0  WU:0.239/m
ST: 1  SS: 0  NB: 658  LW: 8  GF: 0  RF: 0
Connected to Trademybit_X11_multi (stratum) diff 0.005 as user mrbrdo.1
Block: aacaff46...  Diff:329  Started: [09:14:41]  Best share: 0.006
--------------------------------------------------------------------------------
[P]ool management [G]PU management [S]ettings [D]isplay options [Q]uit
GPU 0:  58.0C 2234RPM | 2.580M/ 0.000h/s | R:  0.0% HW:0 WU:0.000/m I:18
GPU 1:  55.0C 2176RPM | 2.560M/ 0.000h/s | R:  0.0% HW:0 WU:0.000/m I:18
GPU 2:  54.0C 2173RPM | 2.576M/ 0.000h/s | R:  0.0% HW:0 WU:0.239/m I:18
GPU 3:  56.0C 2086RPM | 2.573M/ 0.000h/s | R:  0.0% HW:0 WU:0.000/m I:18
--------------------------------------------------------------------------------
[09:14:16] Accepted 1b699e79 Diff 0.036/0.031 GPU 2 at Trademybit_X15_multi
[09:14:23] Trademybit_X11_multi alive, testing stability
[09:14:30] Accepted 0f17c00e Diff 0.066/0.031 GPU 1 at Trademybit_X15_multi
[09:14:37] Accepted 0e6eaa18 Diff 0.069/0.031 GPU 1 at Trademybit_X15_multi
[09:14:43] Stratum connection to Trademybit_X15_multi interrupted
[09:14:43] Trademybit_X15_multi not responding!
[09:14:43] Switching to Trademybit_X11_multi
[09:14:53] Applying pool settings for Trademybit_X11_multi...
[09:14:54] Initialising kernel darkcoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[09:14:54] Initialising kernel darkcoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[09:14:54] Initialising kernel darkcoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[09:14:54] Initialising kernel darkcoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[09:14:54] Initialising kernel darkcoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[09:14:54] Initialising kernel darkcoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[09:14:54] Initialising kernel darkcoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[09:14:54] Initialising kernel darkcoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[09:14:54] Accepted a381ecee Diff 0.006/0.005 GPU 2 at Trademybit_X11_multi
[09:14:56] Accepted 6293bedd Diff 0.010/0.005 GPU 3 at Trademybit_X11_multi
[09:17:48] Stratum connection to Trademybit_X11_multi interrupted
[09:18:57] NiceHash_X11_multi extranonce change requested
[09:22:02] Stratum connection to NiceHash_X15_backup interrupted
[19:24:57] Shutdown signal received.
[09:30:28] Stratum connection to Trademybit_X15_multi interrupted
[09:42:14] Stratum connection to Trademybit_X13_multi interrupted
[09:28:53] NiceHash_X15_multi extranonce change requested
[19:24:57] Stratum connection to NiceHash_X11_multi interrupted
[19:24:57] Stratum connection to NiceHash_X15_multi interrupted
[19:24:57] Trademybit_X11_multi not responding!
[19:24:57] Switching to Trademybit_X15_multi

Miner3 (not frozen, just shut down itself):

[11:47:40] Trademybit_X13_multi alive, testing stability
[11:47:47] Stratum connection to Trademybit_X15_multi interrupted
[11:47:47] Trademybit_X15_multi not responding!
[11:47:47] Switching to Trademybit_X13_multi
[11:47:57] Applying pool settings for Trademybit_X13_multi...
[11:47:57] Initialising kernel marucoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[11:47:57] Initialising kernel marucoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[11:47:57] Initialising kernel marucoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[11:47:57] Initialising kernel marucoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[11:47:57] Initialising kernel marucoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[11:47:57] Initialising kernel marucoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[11:47:57] Initialising kernel marucoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[11:47:57] Initialising kernel marucoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[11:47:57] Initialising kernel marucoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[11:47:57] Initialising kernel marucoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[11:47:57] thread 2 create failed
[11:47:58]
Summary of runtime statistics:
mrbrdo commented 10 years ago

Now I am not 100%, but possibly the last rig that kept working after that was not using the same .bin kernels as the other 3. I am using kernels compiled on 14.6 beta drivers on Windows while I have older drivers and am on Linux. So it is possible that the different bin files caused the freezes, which would actually make a lot of sense since freezing usually happens because of GPU driver.

troky commented 10 years ago

I had similar problem today. 2/4 rigs frozen without any unusual log message. Reverted one rig from 14.6 RC to 13.12, deleted .bins and monitoring...

mrbrdo commented 10 years ago

I get them relatively often now... Sometimes only 1 card stops working. I had no such problems before the 14.6 bins so it's quite possible this is causing it (although I also keep upgrading sgminer, but I don't think jansson update could cause freeze). Btw, did you actually upgrade drivers to 14.6 or you just use bins you got from somewhere? Because I do the latter, I tried building myself on an 32-bit ubuntu with 14.6 (may23) but the performance was exactly the same as on older drivers for some reason.

troky commented 10 years ago

I did clean install of 14.6 RC (june 23). Win7 x64. Using 14.6 bins on 13.12 and so far no crashes. Other machines (14.6 RC) work fine so far. Since I have 4 almost identical rigs I'll interchange bins between working and problematic rig to see if that is the problem.

mrbrdo commented 10 years ago

There's a june build of 14.6 too? I thought may23 was the latest. Could you put your bin files on mega or somewhere, if you're using 280x?

I just had another shutdown, same as before thread 2 create failed. I wonder why that happens. It shouldn't be opencl-related in this case, it seems pthread_create fails for some reason. I'm gonna add a log for the return value and keep running this setup on this rig.

[08:36:56] Trademybit_X11_multi not responding!
[08:36:56] Switching to Trademybit_X15_multi
[08:36:57] Trademybit_X15_multi difficulty changed to 0.002
[08:37:06] Applying pool settings for Trademybit_X15_multi...
[08:37:07] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[08:37:07] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[08:37:07] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[08:37:07] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[08:37:07] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[08:37:07] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[08:37:07] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[08:37:07] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[08:37:07] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[08:37:07] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[08:37:07] thread 2 create failed
[08:37:07]
Summary of runtime statistics:
troky commented 10 years ago

I use 290 only. It looks like new 14.6 RC doesn't like overclocking. Running smoothly on stock clocks (947/1250) but when I change clock a bit it crashes in few minutes/hours. Other rigs are overclocked (1020/1500) and work without problems.

There are some 14.6 rc drivers available from guru3d. I can't recall if I am using original or modified guru3d drivers.

ystarnaud commented 10 years ago

@troky did you try using older 13.12 or 14.4 drivers and simply have the 14.6 dlls from 14.6 in your sgminer folder? That's what I do to get 14.6 speed on my windows rigs. Although I don't have the picky R9 290s...

I know the above doesn't help @mrbrdo but if we change stuff to soft reset except for when gpu thread count changes then this might get around this issue. Just a thought.

mrbrdo commented 10 years ago

@ystarnaud yeah we should always do soft reset if gpu-threads are not changed... I can test if my rigs still crash with soft reset.

I got a crash again with the 14.6 bins. [16:47:58] thread 4 create failed./start.sh: line 5: 23803 Segmentation fault in this case pthread_create returned 11 (EAGAIN), which means not enough resources or thread limit (20000) reached. The 3 other rigs I am running on the normal bins (not 14.6) and they are still stable, so I am really sure the bins are causing the freezes.

troky commented 10 years ago

@ystarnaud not using dlls... full drivers install only Currently only one rig is problematic... today it didn't freeze but got 0 hashrate.

mrbrdo commented 10 years ago

By the way possibly related to my thread create failed messages: Even if pthread_exit or pthread_cancel is called, the parent process still need to call pthread_join to release the pthread ID, which will then become recyclable. (unless you call pthread_detach()). Should check that we always do this. In the case when it failed I had 33 switches and 10 threads total, so 330 threads were created/killed.

mrbrdo commented 10 years ago

I am getting thread create failed on non-14.6 bins too. Something is wrong. :-/

mrbrdo commented 10 years ago

This time after 32 switches (5 GPUs, gpu-threads 2). So that is 320 threads. This is interesting: http://stackoverflow.com/questions/17062413/pthread-create-fails-with-eagain-at-291-cycle

If you create 290 threads, that's using nearly 3Gb of address space - the max for a 32 bit process.

Although my other rigs don't yet exhibit this problem for now... The thread that could not be created is different each time (4, 6 so far).

mrbrdo commented 10 years ago

I think maybe I have found and fixed the thread create issue in https://github.com/sgminer-dev/sgminer/commit/e33590f37d060cebd7ef5d7c1f929926fbe21e29 The problem was that thr_info_cancel reset pth to 0 and so the thread was not joined afterwards. In man of pthreads it states that if a thread is not detached (in this case true) then it must be joined so the resources can be freed.

In my mrbrdo_testing branch, I also made threads exit themselves when doing hard reset (just seems safer and faster), and also changed so hard reset is not always needed when switching algorithm. I will first test to see if the thread join fix helped then I will merge the rest.

troky commented 10 years ago

I am going to test mrbrdo_testing today on problematic rig @14.6RC. Other 3 14.6 rigs work fine for >48h.

FYI, I never got thread X create failed on Windows (7 x64). 4x290 with 8GB system RAM.

mrbrdo commented 10 years ago

I've been running stable on 4 rigs for almost 24 hours now, in valgrind. Even on the 14.6 bins. For the record I am using valgrind 3.9.0 with this command: valgrind --show-possibly-lost=no --undef-value-errors=no --show-reachable=no --error-limit=no --leak-check=no --freelist-vol=100000000 --log-file=valgrind.log -v ./sgminer -c my.conf 2>debug.log Most of the errors are in strdup and getaddrinfo but those are not real errors (just some optimization with memory alignment), and there is some stuff coming from fglrx, I guess they do not use traditional memory allocation.

mrbrdo commented 10 years ago

I still get freezes occasionally with these bins. Before I started using them I never had any freezes (but could be just coincidence).

mrbrdo commented 10 years ago

A log of a freeze: https://gist.github.com/mrbrdo/3eeba880038bea51d604

platinum4 commented 9 years ago

Since many people are still using 14.6RC2 for the neoscrypt & Lyra2RE algorithms, I'd say this issue has long been resolved, but was never closed.