Open mrbrdo opened 10 years ago
Can you replicate this?
I am occasionally experiencing similar problem (frozen GUI...) on Windows but it isn't pool (dis)connection related. It happens randomly. I still can't trace the problem.
Not sure what happens but today 2 out of 4 of my rigs froze. They weren't defunct this time but they were frozen and became defunct after trying to kill -9 them. Unfortunately I had ncurses enabled so my log is not that helpful, but here it is anyway. However, a 3rd miner also stopped working but it shut down gracefully with "thread 2 create failed". Also interestingly the only rig that did not freeze/shutdown was the one I am running in debug mode without ncurses, and also that rig has ASUS 280x instead of Sapphire (but this probably does not matter).
Miner1:
sgminer 4.2.2-259-g855a - Started: [2014-07-05 04:53:32] - [0 days 05:38:51]
--------------------------------------------------------------------------------
(5s):5.493M (avg):11.29Mh/s | A:1 R:0 HW:0 WU:0.149/m
ST: 2 SS: 0 NB: 893 LW: 13 GF: 0 RF: 0
Connected to NiceHash_X15_multi (stratum) diff 0.100 as user 15bULC8snaKAMeFb3xBmmhbWj1xyTmBUfm
Block: 64331e33... Diff:280 Started: [10:32:24] Best share: 0.000
--------------------------------------------------------------------------------
[P]ool management [G]PU management [S]ettings [D]isplay options [Q]uit
GPU 0: 64.0C 1365RPM | 3.821M/ 0.000h/s | R: 0.0% HW:0 WU:0.000/m I:18
GPU 1: 59.0C 2233RPM | 3.852M/ 0.000h/s | R: 0.0% HW:0 WU:0.000/m I:18
GPU 2: 56.0C 2163RPM | 3.844M/ 0.000h/s | R: 0.0% HW:0 WU:0.000/m I:18
--------------------------------------------------------------------------------
[10:30:44] Accepted 2d5e65f9 Diff 0.022/0.019 GPU 1 at Trademybit_X11_multi
[10:30:48] Accepted 1b0fa7e1 Diff 0.037/0.019 GPU 0 at Trademybit_X11_multi
[10:30:48] Accepted 2d3ce846 Diff 0.022/0.019 GPU 0 at Trademybit_X11_multi
[10:30:56] Accepted 02d73453 Diff 0.352/0.019 GPU 2 at Trademybit_X11_multi
[10:31:02] Accepted 1c07cc0a Diff 0.036/0.019 GPU 1 at Trademybit_X11_multi
[10:31:06] Accepted 196ba149 Diff 0.039/0.019 GPU 1 at Trademybit_X11_multi
[10:31:13] Accepted 0915d3cb Diff 0.110/0.019 GPU 1 at Trademybit_X11_multi
[10:31:21] Accepted 2a28726e Diff 0.024/0.019 GPU 0 at Trademybit_X11_multi
[10:31:22] Accepted 347db203 Diff 0.019/0.019 GPU 0 at Trademybit_X11_multi
[10:31:23] Accepted 24b6cd30 Diff 0.027/0.019 GPU 1 at Trademybit_X11_multi
[10:31:32] Accepted 1c277cf8 Diff 0.036/0.019 GPU 1 at Trademybit_X11_multi
[10:31:35] Accepted 2744004f Diff 0.025/0.019 GPU 2 at Trademybit_X11_multi
[10:31:35] Accepted 1104fde8 Diff 0.059/0.019 GPU 1 at Trademybit_X11_multi
[10:31:40] Accepted 249e9fc4 Diff 0.027/0.019 GPU 0 at Trademybit_X11_multi
[10:31:57] Accepted 0d601e54 Diff 0.075/0.019 GPU 2 at Trademybit_X11_multi
[10:32:01] Accepted 3449f76e Diff 0.019/0.019 GPU 0 at Trademybit_X11_multi
[10:32:12] Stratum connection to Trademybit_X11_multi interrupted
[10:32:12] Trademybit_X11_multi not responding!
[10:32:12] Switching to Trademybit_X15_multi
[10:32:13] Trademybit_X15_multi not responding!
[10:32:13] Switching to NiceHash_X15_multi
[10:32:13] NiceHash_X15_multi difficulty changed to 0.100
[10:32:22] Applying pool settings for NiceHash_X15_multi...
[10:32:23] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[10:32:23] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[10:32:23] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[10:32:23] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[10:32:23] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[10:32:23] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[10:32:24] NiceHash_X15_multi extranonce change requested
sgminer 4.2.2-259-g855a - Started: [2014-07-05 04:53:32] - [0 days 05:38:51]
--------------------------------------------------------------------------------
(5s):5.493M (avg):11.29Mh/s | A:1 R:0 HW:0 WU:0.149/m
ST: 2 SS: 0 NB: 893 LW: 13 GF: 0 RF: 0
Connected to NiceHash_X15_multi (stratum) diff 0.100 as user 15bULC8snaKAMeFb3xBmmhbWj1xyTmBUfm
Block: 64331e33... Diff:280 Started: [10:32:24] Best share: 0.000
--------------------------------------------------------------------------------
[P]ool management [G]PU management [S]ettings [D]isplay options [Q]uit
GPU 0: 64.0C 1365RPM | 3.821M/ 0.000h/s | R: 0.0% HW:0 WU:0.000/m I:18
GPU 1: 59.0C 2233RPM | 3.852M/ 0.000h/s | R: 0.0% HW:0 WU:0.000/m I:18
GPU 2: 56.0C 2163RPM | 3.844M/ 0.000h/s | R: 0.0% HW:0 WU:0.000/m I:18
--------------------------------------------------------------------------------
[10:32:13] Trademybit_X15_multi not responding!
[10:32:13] Switching to NiceHash_X15_multi
[10:32:13] NiceHash_X15_multi difficulty changed to 0.100
[10:32:22] Applying pool settings for NiceHash_X15_multi...
[10:32:23] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[10:32:23] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[10:35:22] Trademybit_X15_multi alive, testing stability
[10:42:46] Trademybit_X11_multi alive, testing stability
[11:47:05] Trademybit_X13_multi alive, testing stability
[13:25:00] NiceHash_X11_multi alive, testing stability
[10:34:02] NiceHash_X15_multi extranonce change requested
[19:22:51] NiceHash_X15_multi difficulty changed to 0.008
[19:22:51] NiceHash_X15_multi difficulty changed to 0.044
Miner2:
miner2
sgminer 4.2.2-259-g855a - Started: [2014-07-05 04:54:06] - [0 days 04:20:49]
--------------------------------------------------------------------------------
(5s):0.000 (avg):0.000h/s | A:0 R:0 HW:0 WU:0.239/m
ST: 1 SS: 0 NB: 658 LW: 8 GF: 0 RF: 0
Connected to Trademybit_X11_multi (stratum) diff 0.005 as user mrbrdo.1
Block: aacaff46... Diff:329 Started: [09:14:41] Best share: 0.006
--------------------------------------------------------------------------------
[P]ool management [G]PU management [S]ettings [D]isplay options [Q]uit
GPU 0: 58.0C 2234RPM | 2.580M/ 0.000h/s | R: 0.0% HW:0 WU:0.000/m I:18
GPU 1: 55.0C 2176RPM | 2.560M/ 0.000h/s | R: 0.0% HW:0 WU:0.000/m I:18
GPU 2: 54.0C 2173RPM | 2.576M/ 0.000h/s | R: 0.0% HW:0 WU:0.239/m I:18
GPU 3: 56.0C 2086RPM | 2.573M/ 0.000h/s | R: 0.0% HW:0 WU:0.000/m I:18
--------------------------------------------------------------------------------
[09:14:16] Accepted 1b699e79 Diff 0.036/0.031 GPU 2 at Trademybit_X15_multi
[09:14:23] Trademybit_X11_multi alive, testing stability
[09:14:30] Accepted 0f17c00e Diff 0.066/0.031 GPU 1 at Trademybit_X15_multi
[09:14:37] Accepted 0e6eaa18 Diff 0.069/0.031 GPU 1 at Trademybit_X15_multi
[09:14:43] Stratum connection to Trademybit_X15_multi interrupted
[09:14:43] Trademybit_X15_multi not responding!
[09:14:43] Switching to Trademybit_X11_multi
[09:14:53] Applying pool settings for Trademybit_X11_multi...
[09:14:54] Initialising kernel darkcoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[09:14:54] Initialising kernel darkcoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[09:14:54] Initialising kernel darkcoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[09:14:54] Initialising kernel darkcoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[09:14:54] Initialising kernel darkcoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[09:14:54] Initialising kernel darkcoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[09:14:54] Initialising kernel darkcoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[09:14:54] Initialising kernel darkcoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[09:14:54] Accepted a381ecee Diff 0.006/0.005 GPU 2 at Trademybit_X11_multi
[09:14:56] Accepted 6293bedd Diff 0.010/0.005 GPU 3 at Trademybit_X11_multi
[09:17:48] Stratum connection to Trademybit_X11_multi interrupted
[09:18:57] NiceHash_X11_multi extranonce change requested
[09:22:02] Stratum connection to NiceHash_X15_backup interrupted
[19:24:57] Shutdown signal received.
[09:30:28] Stratum connection to Trademybit_X15_multi interrupted
[09:42:14] Stratum connection to Trademybit_X13_multi interrupted
[09:28:53] NiceHash_X15_multi extranonce change requested
[19:24:57] Stratum connection to NiceHash_X11_multi interrupted
[19:24:57] Stratum connection to NiceHash_X15_multi interrupted
[19:24:57] Trademybit_X11_multi not responding!
[19:24:57] Switching to Trademybit_X15_multi
Miner3 (not frozen, just shut down itself):
[11:47:40] Trademybit_X13_multi alive, testing stability
[11:47:47] Stratum connection to Trademybit_X15_multi interrupted
[11:47:47] Trademybit_X15_multi not responding!
[11:47:47] Switching to Trademybit_X13_multi
[11:47:57] Applying pool settings for Trademybit_X13_multi...
[11:47:57] Initialising kernel marucoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[11:47:57] Initialising kernel marucoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[11:47:57] Initialising kernel marucoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[11:47:57] Initialising kernel marucoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[11:47:57] Initialising kernel marucoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[11:47:57] Initialising kernel marucoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[11:47:57] Initialising kernel marucoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[11:47:57] Initialising kernel marucoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[11:47:57] Initialising kernel marucoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[11:47:57] Initialising kernel marucoin-mod.cl with bitalign, unpatched BFI, nfactor 0, n 0
[11:47:57] thread 2 create failed
[11:47:58]
Summary of runtime statistics:
Now I am not 100%, but possibly the last rig that kept working after that was not using the same .bin kernels as the other 3. I am using kernels compiled on 14.6 beta drivers on Windows while I have older drivers and am on Linux. So it is possible that the different bin files caused the freezes, which would actually make a lot of sense since freezing usually happens because of GPU driver.
I had similar problem today. 2/4 rigs frozen without any unusual log message. Reverted one rig from 14.6 RC to 13.12, deleted .bins and monitoring...
I get them relatively often now... Sometimes only 1 card stops working. I had no such problems before the 14.6 bins so it's quite possible this is causing it (although I also keep upgrading sgminer, but I don't think jansson update could cause freeze). Btw, did you actually upgrade drivers to 14.6 or you just use bins you got from somewhere? Because I do the latter, I tried building myself on an 32-bit ubuntu with 14.6 (may23) but the performance was exactly the same as on older drivers for some reason.
I did clean install of 14.6 RC (june 23). Win7 x64. Using 14.6 bins on 13.12 and so far no crashes. Other machines (14.6 RC) work fine so far. Since I have 4 almost identical rigs I'll interchange bins between working and problematic rig to see if that is the problem.
There's a june build of 14.6 too? I thought may23 was the latest. Could you put your bin files on mega or somewhere, if you're using 280x?
I just had another shutdown, same as before thread 2 create failed
. I wonder why that happens. It shouldn't be opencl-related in this case, it seems pthread_create
fails for some reason. I'm gonna add a log for the return value and keep running this setup on this rig.
[08:36:56] Trademybit_X11_multi not responding!
[08:36:56] Switching to Trademybit_X15_multi
[08:36:57] Trademybit_X15_multi difficulty changed to 0.002
[08:37:06] Applying pool settings for Trademybit_X15_multi...
[08:37:07] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[08:37:07] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[08:37:07] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[08:37:07] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[08:37:07] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[08:37:07] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[08:37:07] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[08:37:07] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[08:37:07] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[08:37:07] Initialising kernel bitblock.cl with bitalign, unpatched BFI, nfactor 0, n 0
[08:37:07] thread 2 create failed
[08:37:07]
Summary of runtime statistics:
I use 290 only. It looks like new 14.6 RC doesn't like overclocking. Running smoothly on stock clocks (947/1250) but when I change clock a bit it crashes in few minutes/hours. Other rigs are overclocked (1020/1500) and work without problems.
There are some 14.6 rc drivers available from guru3d. I can't recall if I am using original or modified guru3d drivers.
@troky did you try using older 13.12 or 14.4 drivers and simply have the 14.6 dlls from 14.6 in your sgminer folder? That's what I do to get 14.6 speed on my windows rigs. Although I don't have the picky R9 290s...
I know the above doesn't help @mrbrdo but if we change stuff to soft reset except for when gpu thread count changes then this might get around this issue. Just a thought.
@ystarnaud yeah we should always do soft reset if gpu-threads are not changed... I can test if my rigs still crash with soft reset.
I got a crash again with the 14.6 bins. [16:47:58] thread 4 create failed./start.sh: line 5: 23803 Segmentation fault
in this case pthread_create returned 11 (EAGAIN), which means not enough resources or thread limit (20000) reached. The 3 other rigs I am running on the normal bins (not 14.6) and they are still stable, so I am really sure the bins are causing the freezes.
@ystarnaud not using dlls... full drivers install only Currently only one rig is problematic... today it didn't freeze but got 0 hashrate.
By the way possibly related to my thread create failed messages: Even if pthread_exit or pthread_cancel is called, the parent process still need to call pthread_join to release the pthread ID, which will then become recyclable.
(unless you call pthread_detach()). Should check that we always do this. In the case when it failed I had 33 switches and 10 threads total, so 330 threads were created/killed.
I am getting thread create failed on non-14.6 bins too. Something is wrong. :-/
This time after 32 switches (5 GPUs, gpu-threads 2). So that is 320 threads. This is interesting: http://stackoverflow.com/questions/17062413/pthread-create-fails-with-eagain-at-291-cycle
If you create 290 threads, that's using nearly 3Gb of address space - the max for a 32 bit process.
Although my other rigs don't yet exhibit this problem for now... The thread that could not be created is different each time (4, 6 so far).
I think maybe I have found and fixed the thread create issue in https://github.com/sgminer-dev/sgminer/commit/e33590f37d060cebd7ef5d7c1f929926fbe21e29 The problem was that thr_info_cancel reset pth to 0 and so the thread was not joined afterwards. In man of pthreads it states that if a thread is not detached (in this case true) then it must be joined so the resources can be freed.
In my mrbrdo_testing branch, I also made threads exit themselves when doing hard reset (just seems safer and faster), and also changed so hard reset is not always needed when switching algorithm. I will first test to see if the thread join fix helped then I will merge the rest.
I am going to test mrbrdo_testing today on problematic rig @14.6RC. Other 3 14.6 rigs work fine for >48h.
FYI, I never got thread X create failed
on Windows (7 x64). 4x290 with 8GB system RAM.
I've been running stable on 4 rigs for almost 24 hours now, in valgrind. Even on the 14.6 bins.
For the record I am using valgrind 3.9.0 with this command: valgrind --show-possibly-lost=no --undef-value-errors=no --show-reachable=no --error-limit=no --leak-check=no --freelist-vol=100000000 --log-file=valgrind.log -v ./sgminer -c my.conf 2>debug.log
Most of the errors are in strdup and getaddrinfo but those are not real errors (just some optimization with memory alignment), and there is some stuff coming from fglrx, I guess they do not use traditional memory allocation.
I still get freezes occasionally with these bins. Before I started using them I never had any freezes (but could be just coincidence).
A log of a freeze: https://gist.github.com/mrbrdo/3eeba880038bea51d604
Since many people are still using 14.6RC2 for the neoscrypt & Lyra2RE algorithms, I'd say this issue has long been resolved, but was never closed.
Froze after all pools were down for a bit, GUI completely unresponsive, API timed out. ps shows process as defunct
[sgminer] <defunct>
.might be realted to 14.6 beta drivers bin files