sgminer-dev / sgminer

Scrypt GPU miner
GNU General Public License v3.0
631 stars 825 forks source link

Several segfaults with debug log #331

Closed mrbrdo closed 10 years ago

mrbrdo commented 10 years ago

I got segfaults on all 4 of my rigs within 60 seconds of each other. The first one seems to have a different one than the rest. debug logs: https://gist.github.com/mrbrdo/12be345923aad9b189d7

The rigs are running different sgminer versions:

Not sure if relevant but miner2-4 all have Failed to parse a \n terminated string in recv_line shortly before the crash.

troky commented 10 years ago

I often see Failed to parse a \n terminated string in recv_line in logs but I don't think it is related because I saw it long time before crashes occurred.

mrbrdo commented 10 years ago

https://github.com/sgminer-dev/sgminer/commit/e33590f37d060cebd7ef5d7c1f929926fbe21e29 could possibly be to blame since it is present in all 4 of the versions I was running, and I didn't have these segfaults before it. But it could be just random.

mrbrdo commented 10 years ago

I updated the implementation a bit so I avoid alloc at all, I will run one rig with this to see if that was causing the issue.

mrbrdo commented 10 years ago

Oh right, it couldn't have been that because miner4's version did a soft threads reset so kill_mining wasn't even called. Damn, it's something else. Running all my miners with gdb now.

troky commented 10 years ago

So far so good with the latest build. No crashes. All rigs @ 14.6 RC.

I am still not convinced this is sgminer problem. Keep in mind that we are using beta/rc drivers in weird combinations on OC'ed cards :) For example, I don't have any issues with 13.12 @ stock clocks and native kernels... but I just don't like the performance.

Running one rig in VS debugger, too...

mrbrdo commented 10 years ago

Well since the segfaults are not coming from AMD drivers I think it is a sgminer issue. Since all my rigs crashed basically within 60 seconds of each other, I think it was something the pools sent. Also I am using the 14.6 bins only on half of my rigs but all crashed.

troky commented 10 years ago

Oh, I didn't realize you had a synced crash. I suppose we have different type of crashes then because my crashes are/were random.

mrbrdo commented 10 years ago

Also, running in gdb was a disaster, after a few switches some GPUs simply stopped hashing (on some rigs all GPUs on some rigs only some, even on non-14.6 bins). Trying to quit then resulted in defunct sgminer so I had to reboot. :/ I wonder what makes the GPUs hang like that. I can run in valgrind, though.

mrbrdo commented 10 years ago

Found/fixed one https://github.com/sgminer-dev/sgminer/issues/334

mrbrdo commented 10 years ago

So after a week or two I only seem to be having issues on one of four rigs, not sure if maybe one of the cards is broken or something. It keeps freezing every few days.

troky commented 10 years ago

I moved my problematic rig to Linux to see if problem persists... It looks like some of cards doesn't like particular engine/memory clock combinations. After few days of fine tuning I've managed to achieve (for me) good uptime... Watchdog restarts rig every few days because of cards stop hashing. However, weird thing is that I get 10-15% lower hashrates compared to Windows with the same config (and kernels).

mrbrdo commented 10 years ago

Just today one of my rigs which was running fine for a week or more got a frozen sgminer. I attempted to restart as usual but it seems it did not reboot. :/ I am not getting much segfaults anymore but it seems this freezes are more common now. Not sure if it is caused by algo switching or by the implementation of the algos itself, since I know there were some stability issues with X11-mod at the beginning.

EDIT: So today I restarted both of the frozen rigs manually and they are working fine again. I am waiting for 14.6 final drivers, hopefully they will help with the freezing.

mrbrdo commented 10 years ago

Here is a debug log of a rig where sgminer froze: https://gist.github.com/mrbrdo/3eeba880038bea51d604

mrbrdo commented 10 years ago

I believe the segfaulting issue was fixed, I am not getting them anymore, only freezes. So I am closing this.