sgminer-dev / sgminer

Scrypt GPU miner
GNU General Public License v3.0
629 stars 825 forks source link

kernel building segfaults - reproducable #222

Closed mfiano closed 10 years ago

mfiano commented 10 years ago

I just spent the last 4 hours debugging this problem, and I think I've narrowed it down enough to give a bug report:

The problem: When building a kernel, sgminer will segfault if a kernel with a different nf value already exists. For example, if "zuikkisPitcairnglg2tc15505nf10w256l4.bin" already exists, then building the kernel for pool 2 with nfactor 11 will crash sgminer. This will occur if 11 exists and 10 is attempted to be built as well. Additionally, it will also crash if mix and matching kernels - for example, if zuikkis with nf10 exists, building bufius with nf11 will crash it.

What I have discovered:

The setup: Using pool-specific GPU settings, I have 2 pools - 1 for Scrypt and 1 for Scrypt-N as follows: { "name" : "Pool 1", "url" : "xxx", "user" : "xxx", "pass" : "xxx", "pool-nfactor" : "10", "pool-algorithm" : "zuikkis", "pool-intensity" : "13", "pool-gpu-engine" : "1070", "pool-thread-concurrency" : "15505" }, { "name" : "Pool 2", "url" : "xxx", "user" : "xxx", "pass" : "xxx", "pool-nfactor" : "11", "pool-algorithm" : "zuikkis", "pool-intensity" : "18", "pool-gpu-engine" : "1100", "pool-thread-concurrency" : "15505" }, {

mfiano commented 10 years ago

I forgot to mention, using the latest commit from the v5_0 branch at the time of writing.

mrbrdo commented 10 years ago

Thanks for the detailed instructions! However I cannot reproduce this. I added 2 pools just like you, and same pool-* settings like you have (different values of course).

Which cards do you have? I have R9 280X. Also from your settings I don't understand why you have such a big difference between intensity for Scrypt and Scrypt-N? I use the same value normally...

mrbrdo commented 10 years ago

I think I see one part of the problem.. Try moving pool-nfactor to be always below pool-algorithm. However we should still find out why it's segfaulting... Definitely first try to use lower intensity on scrypt-n.

mfiano commented 10 years ago

"This only occurs on my R9 270 rig (Pitcairn)...other rigs with other types of cards work fine."

They are Sapphire 270's. The intensities were typos. They are both 19 in my config. The problem is it segfaults at "Building zuikkisPitcairn...". It even throws an error that the GPU was idle for 60 seconds and declaring SICK...but it hasn't even been running for 60 seconds...this is as soon as it atempts to build the kernel (the file never even shows up)

mfiano commented 10 years ago

Moving pool-nfactor below pool-algorithm has no effect. As soon as it tries building nf11 binary, it segfaults. Happens everytime with my Pitcairn GPU rigs. The only fix is to only use one or the other...just scrypt or scrypt-n. If you delete the nf10 binary, nf11 can generate without segfault. If you delete the nf11 binary, nf10 can generate without segfault. Both cannot exist on disk simultaneously.

sterlingpickens commented 10 years ago

The bigger nfactor is asking for more memory to be allocated 2032271360 bytes. Try lowering the TC and see if the problem don't go away.

mrbrdo commented 10 years ago

@mfiano I saw this before "It even throws an error that the GPU was idle for 60 seconds and declaring SICK". I think this is the actual bug and what is causing the segfault. It tries to restart GPU while it's not yet even initialized (compiling kernel) and that's why it crashes.

mfiano commented 10 years ago

TC has no effect. As mentioned, it builds the kernel fine if the other kernel doesn't exist on disk. Either the bigger memory or smaller memory kernel..it doesn't matter. If one of them exists already, boom.

mrbrdo commented 10 years ago

I am 80% sure that this "kernel with other n-factor already exists" has nothing to do with it.

mfiano commented 10 years ago

@mrbrdo It is not random whatsoever. It only occurs when it exists. I can kill sgminer and start it again and if one of them exists, it will segfault if it tries building the other one. This is reproducable 100% of the time (tried about 50 times). In the control, if no file exists, either kernel will build fine.

mrbrdo commented 10 years ago

I can't reproduce it. But I am working on fixing the "declaring SICK" bug and you can try that after I'm done.

mrbrdo commented 10 years ago

Try the updated v5_0 branch. If it still crashes please run with -T -D and show me the last 10-20 lines of output before it crashes.

mfiano commented 10 years ago

Ok will do

mfiano commented 10 years ago

@mrbrdo This didn't fix the problem. Here are the logs: With debug: http://pastebin.com/7PqnLmeR Without debug: http://pastebin.com/g5wjCiYn

mrbrdo commented 10 years ago

Can you show me your full config (not only pools)? You can remove user/password but leave everything else please.

mfiano commented 10 years ago

http://pastebin.com/z2xJzd4q

mrbrdo commented 10 years ago

Removing pool-gpu-threads doesn't change anything, right? I have an idea what I can check also, will do that now. If you can, hop on IRC (Freenode, #sgminer-dev), will make this faster.

mfiano commented 10 years ago

Right, I recently added that as a test

mrbrdo commented 10 years ago

Sorry for asking, but is it possible to get a log when there is no pool-gpu-threads? Particularly I'm interested if it still becomes SICK after the last changes I made. There is a difference how threads are restarted when there is pool-gpu-threads (more "hardcore" restart), and I see how the SICK could happen in this case, but not the other one.

mfiano commented 10 years ago

This issue has been fixed thanks to @mrbrdo . Changes commited.

mrbrdo commented 10 years ago

( fixed with https://github.com/sgminer-dev/sgminer/commit/00d17d16fee1dc66719eec9b424f3682ddd168dd )