nanopool / Claymore-Dual-Miner

Claymore's Dual Ethereum+Decred_Siacoin_Lbry AMD+NVIDIA GPU Miner
1.12k stars 276 forks source link

Miner crashes randomly and will not hash correctly on Nvidia GT 1060 OC 6GB #59

Open tateconcepts opened 7 years ago

tateconcepts commented 7 years ago

Hello Claymore,

I found an odd issue after I switched from ZEC to your miner for ETH. This is a new install on Windows 10 Enterprise and all GPU would only report 2-3H/s with Nvidia driver 384.76 (I initially started with a 37x.xx driver from the end of 2016). In this case, I started with a fresh image, drivers only with Corsair and MSI Afterburner software - same result with an occasional crash in which the miner would not restart itself. I then used the same rig and downgraded to Windows 8.1 with the same driver just to see what occurs. In this case, your miner reports the correct hash rate but again crashes randomly. I have all my environment variables set and 16GB swap file. Do you have any suggestions on what to check here? Are there any debugging options I can set (and output to log files, I asked in a different topic)?

This is problematic because my 7th generation Core i7 will only run on Windows 10 not to mention any security risks. I can upgrade this semi-working rig now if you wish but I need to be able to log consistency and errors. Any feedback you can provide would be awesome! Thank you!

-BT

aenciso commented 7 years ago

Do you have any error messages? Are you overclocking?

costadelsol commented 7 years ago

I have seen such issues on same card (MSI GTX 1060 6gb gaming x) and can confirm that it's afterburner the one causing issue. Try to uninstall completely afterburner, use Windows 10 and latest claymore. You should get 19mhs for each card. Then install afterburner and try my config:

power 50% (yes 50) core +92 memory +925

I have a 8 gpu rig with that configuration dual mining 197mhs eth around 2000 sia. If this configuration doesn't work for you always remember mileage may vary. Try to lower memory clock until you find a stable config.

Also check for your memory clock stock as the normal 1060 are rated at 8000mhz but there are some oc version rated at 9000. if this is the case the memory overclock in afterburner doesn't seem to allow for much improvement.

tateconcepts commented 7 years ago

Let's forget about the crashing, stability is nothing if one can't even get hashing rates where they should be. I have a two of three good screen captures of system/miner/gpu details on Windows 8.1 prior and after with Windows 10 Enterprise. At first, this was attempted on Windows 10 Pro and we shall do so again. What I'd really like to know is (where the hell are the log files for me to debug? The errors or crashing I will post later, because that seems to be related to the programs watchdog essentially not restarting the GPU.

When I did have Afterburner installed and Corsair Link, I could see the GPU drop to nothing in power when the miner is not responding properly Having my wife randomly press the spacebar on an open PowerShell seems to get the thread moving again usually (on any miner for ETH or ZEC) ZEC output is not affected, I can mine ZEC with excellent Sol/s and GPU's are properly detected as well as CPU

I started with fresh install of Windows 10 Ent 1511 - with nothing but required Intel Chipset/Open CL drivers (there is on board video as I use it to drive the display not the Nvidia GPU's). I cannot see a means to disable it in UEFI of Asus PRIME Z270M-PLUS (I'm going to trash this board soon for an X99 when I get this GPU mess resolved). So I can start with fresh Windows 8.1 or Windows 10 Enterprise. It's a PITA to upgrade that OS because you cannot keep files and settings with that version of Windows so I'm going to start once again!

In this case, Windows 10 Pro (aka 10.0 or10240) RTM then upgrade to 1511/1607. I will NOT upgrade to the Creators Update due to deferment of updates closing to only 7 days. I will then, install ONLY the basic Nvidia driver tested up to last year in December just after that build was released (12/16/16 I think). As you will see, there will be NO MSI Afterburner NO Corsair Link - this is just Windows, GPU and drivers. I'm not sure how the statement above confirmed or determined Afterburner is my issue, especially when this issue occurs without it being installed. I have read somewhere that Windows 10 1511 has buggy issues itself with proper GPU memory allocation (which would make sense with the 2.XX Mh/s nonsense I am getting from each GPU). I should get close to 50 Mh/s when running on Windows 8.1 with no hangs. I would have stayed there except the processor for this rig is 7th generation Core i7 7700.

I will post the referenced images for view here earlier of my last two days attempt at this and again this evening if I can Windows 10 Pro 1607 actually doing what it is supposed to. If there is a means to access any logging from the miner (sounds absurd that this miner cannot output a log file, JSON API from pools is delayed an unreliable if I need to address something when I get a Splunk Enterprise alert).

aenciso commented 7 years ago

It's weird you don't have any logs. Mine are stored directly where EthDcrMiner64.exe is and are created every time you start mining.

costadelsol commented 7 years ago

First of all don't blame claymore devs or any other developer as the problem is yours.

Claymore is an executive process than sends orders to the cards and if your cards make 2mhs claymore has no idea that is not right as this is the potential of a GTX 980m so it could be fine. Why throw an error when a card is reporting hashes? Claymore only resets the miner when there are no hashes from the cards meaning there is something wrong.

Mining is somehow an art and finding the origin of problems is your business not claymore's. Blame only claymore when you are sure that it's claymore's fault. Claymore works fine for most people and that's ok.

As I've said your problem seem to be related with overclocking. If you don't use afterburner but you cards are overclocked stock to 9000mhz this could be the issue. I have no idea what's you stock memory MHz but your hash is typical a overclock issue.

My cards are rated at 8000mhz and my max stable setup is +925 so 8925mhz.

Some cards like MSI GX 1060 Gaming X+ are overclocked to 9000 so maybe this card needs underclock to be stable for mining as the memory type might be the same.

My recommendation:

install latest Nvidia driver. install MSI afterburner.

lower the memory clock of your cards at least 200mhz. If this works try to lower least until you get stable config.

Tell me how it was.

On Thursday, July 27, 2017, Alfonso notifications@github.com wrote:

It's weird you don't have any logs. Mine are stored directly where EthDcrMiner64.exe is and are created every time you start mining.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nanopool/Claymore-Dual-Miner/issues/59#issuecomment-318376753, or mute the thread https://github.com/notifications/unsubscribe-auth/AE5DyLBu14X7f5Xt2zew-aKjTJu5cBIPks5sSJzigaJpZM4Oda8z .

tateconcepts commented 7 years ago

@costadelsol Thank you but I think people are misunderstanding me. I'm desire to blame nobody here, probably not even the miner - but it is something related to Windows 10 and the use of these miners. I was hoping others have ran into this - as far as the miner hanging, I'll check for a log in Claymore directory where that binary is if I recall.

I know that the cards need to be tuned and so on. I tuned them before for both ETH and ZEC. I know their sweet spots. What I am trying to convey here is that I can't even get these cards to push defaults out of the box on ETH, just ZEC.

It's a fresh OS and I just did a clean install of Windows 10 Pro now to test on. I'll keep you posted and thanks to all who have replied. I can't say thank you enough!

tateconcepts commented 7 years ago

FYI on the note of the crashing - seems to be intermittent and not related to the hashrate. As I mentioned, it hangs on Windows 8.1 but performs great otherwise - (I now have a 7th gen CPU not 5th and must upgrade to Windows 10) hence this problem. I think it would be foolish to troubleshooting a hanging processes if you can't get the GPU's performing right... nevertheless, I'll defer that to others who have experienced this. I came across this today but I've done all this before I think. https://steemit.com/mining/@fidasx/fixing-windows-10-ethereum-mining

tateconcepts commented 7 years ago

@costadelsol Oh yeah, I don't have that card. I have MSI GTX 1060 6GB Aero https://www.amazon.com/MSI-GTX-1060-AERO-ITX/dp/B06XBGGTVG/ref=sr_1_1?ie=UTF8&qid=1501192953&sr=8-1&keywords=msi+aero+itx+gtx+1060

each are 23Mhz when tuned in Windows 8.1 Pro/Ent with Afterburner around 60-65C and the fans only kick on every now and then, I can see the watt/amps each card from Corsair Link software. If it wasn't for MSI Afterburner and Corsair Link, I'd can this for RHEL in a heartbeat.

tateconcepts commented 7 years ago

Well damn, what an ordeal. I finally have hashrates at 38 combined stock on Win 10 Pro 1703. Latest drivers are 22.21.13.8494 18-JUL-2017. I also did see both the previous logs with a 1500429420_log naming convention that have details I am seeking. Not sure how these roll over though, because I'd like a new one created each hour. I have two options right now

-r -1 60 (as I was the miner to restart every hour if that would make a new log) and I also have -logfile C:\Program Files\Claymore\Logs\%DATE%_Hourly_Logs

But the 60 is not an option (thought I read it was somewhere). Also, the logfile switch now works as it seems %DATE% adds C:\Program Files\Claymore\Logs\07/21/2017_Hourly_Logs - which the date variable seems to throw a cannot find option error. I wonder how I can achieve this?

Thank you again all for your assistance. It seems all of this is entirely related to having both the correct driver AND the correct feature build of Windows. I suppose the 1511/1607 anniversary builds that allowed for more flexibility on unplanned interruptions for forced updates cannot be used any longer. So for those facing this oddity, grab the manual Windows Update utility - update - remove the drivers and download whatever the latest is (although performance with drivers is another matter).

Again, thank you all for great support and really helping out. I'll keep you posted should the miner crash now that we at least can get the GPU's to hash correctly at base. I'll also set optimal settings with Afterburner and have those on hand should I have any further miner hangs. If anyone has comments about how I would achieve this hourly logging, that would be totally awesome as I am sending those via Splunk Forwarder to ES so I am alerted when the miner is down ASAP vs waiting on JSON API from the pool which is delayed.

tateconcepts commented 7 years ago

@costadelsol I spoke too soon and received the first hang after 20-30 minutes.

I this is what I saw prior to noticing the delay. I pressed the spacebar and instantly I receive the 0Mh/s messages and the socket send failed, disconnect with the watchdog message below

ETH: 07/27/17-20:29:23 - New job from us1.ethermine.org:4444 ETH - Total Speed: 46.533 Mh/s, Total Shares: 12, Rejected: 0, Time: 00:26 ETH: GPU0 23.176 Mh/s, GPU1 23.356 Mh/s ETH: 07/27/17-20:29:33 - New job from us1.ethermine.org:4444 ETH - Total Speed: 47.034 Mh/s, Total Shares: 12, Rejected: 0, Time: 00:26 ETH: GPU0 23.422 Mh/s, GPU1 23.611 Mh/s ETH: 07/27/17-20:29:37 - New job from us1.ethermine.org:4444 ETH - Total Speed: 46.651 Mh/s, Total Shares: 12, Rejected: 0, Time: 00:27 ETH: GPU0 23.214 Mh/s, GPU1 23.437 Mh/s GPU0 t=66C fan=63%, GPU1 t=61C fan=53% GPU0 t=66C fan=62%, GPU1 t=61C fan=53% ETH: 07/27/17-20:30:23 - New job from us1.ethermine.org:4444 ETH - Total Speed: 46.942 Mh/s, Total Shares: 12, Rejected: 0, Time: 00:27 ETH: GPU0 23.438 Mh/s, GPU1 23.505 Mh/s ETH: 07/27/17-20:30:43 - New job from us1.ethermine.org:4444 ETH - Total Speed: 0.000 Mh/s, Total Shares: 12, Rejected: 0, Time: 00:39 ETH: GPU0 0.000 Mh/s, GPU1 0.000 Mh/s GPU0 t=66C fan=61%, GPU1 t=61C fan=52% ETH: Stratum - socket send failed 10038, disconnect ETH: Connection lost WATCHDOG: GPU 1 hangs in OpenCL call, exit

C:\Program Files\Claymore>pause Press any key to continue . . .

GPU0/1 are 65% power with +99 core and +935 mem clock

What would I look for in the logs now that cause this, because pressing the spacebar every few hours is the only thing I can do to keep the miner running on any Windows OS before this message occurs.

tateconcepts commented 7 years ago

@ @costadelsol @ @aenciso Current settings in MSI Afterburner upon restart show 1645Mhz/4738Mhz at 0mV/63C BTW

tateconcepts commented 7 years ago

Hit the spacebar again and same message, except that the miner restarted this time and didn't hang

ETH: Authorized Setting DAG epoch #136... Setting DAG epoch #136 for GPU #1 Setting DAG epoch #136 for GPU #0 Create GPU buffer for GPU #0 Create GPU buffer for GPU #1

GPU #0: GeForce GTX 1060 6GB, 6144 MB available, 10 compute units, capability: 6.1

GPU #1: GeForce GTX 1060 6GB, 6144 MB available, 10 compute units, capability: 6.1

ETH - Total Speed: 0.000 Mh/s, Total Shares: 0, Rejected: 0, Time: 00:00 ETH: GPU0 0.000 Mh/s, GPU1 0.000 Mh/s Incorrect ETH shares: none Pool switches: ETH - 0, DCR - 0 Current ETH share target: 0x0000000112e0be82 (diff: 4000MH), epoch #136 GPU0 t=53C fan=38%, GPU1 t=51C fan=37%

GPU 1 DAG creation time - 7140 ms Setting DAG epoch #136 for GPU #1 done GPU 0 DAG creation time - 7183 ms Setting DAG epoch #136 for GPU #0 done ETH: 07/27/17-20:51:08 - New job from us1.ethermine.org:4444 ETH - Total Speed: 47.445 Mh/s, Total Shares: 0, Rejected: 0, Time: 00:00 ETH: GPU0 23.717 Mh/s, GPU1 23.727 Mh/s GPU0 t=56C fan=38%, GPU1 t=53C fan=37% ETH: 07/27/17-20:51:26 - New job from us1.ethermine.org:4444 ETH - Total Speed: 47.261 Mh/s, Total Shares: 0, Rejected: 0, Time: 00:00 ETH: GPU0 23.601 Mh/s, GPU1 23.660 Mh/s ETH: 07/27/17-20:51:49 - New job from us1.ethermine.org:4444 ETH - Total Speed: 47.186 Mh/s, Total Shares: 0, Rejected: 0, Time: 00:03 ETH: GPU0 23.501 Mh/s, GPU1 23.685 Mh/s GPU0 t=58C fan=42%, GPU1 t=56C fan=38% ETH: 07/27/17-20:54:11 - New job from us1.ethermine.org:4444 ETH - Total Speed: 47.276 Mh/s, Total Shares: 0, Rejected: 0, Time: 00:03 ETH: GPU0 23.604 Mh/s, GPU1 23.662 Mh/s ETH: 07/27/17-20:54:11 - SHARE FOUND - (GPU 1) ETH: Share rejected (78 ms)! ETH: Stratum - socket send failed 10053, disconnect ETH: Connection lost GPU0 t=65C fan=52%, GPU1 t=60C fan=44% ETH: Stratum - connecting to 'us1.ethermine.org' <149.56.26.221> port 4444 ETH: Stratum - Connected (us1.ethermine.org:4444) ETH: Authorized ETH: 07/27/17-20:54:45 - SHARE FOUND - (GPU 0) ETH: Share accepted (78 ms)!

yodathegrey commented 7 years ago

Can mods please close this thread? Pebkac problems do not require github error tickets.

tateconcepts commented 7 years ago

Hmm, I appreciate your comment. I am not the error nor does my Windows event logs generate any errors. I have no other issues now but the following

ETH: Stratum - socket send failed 10038, disconnect ETH: Connection lost WATCHDOG: GPU 1 hangs in OpenCL call, exit

That is not a problem between myself and they keyboard. I fairly confident mining software is becoming more competitive and will likely become a SaaS like solution in the near future. There are other miners to choose from, I pay the fee just like everyone. My GPU does not hang with any other miner - that's the issue here.

I would have preferred to only open my thread with - why is there a GPU hanging only after it attempts to make a stratum connection? I don't have temps going out the roof, I am not too far overclocked - OS does not freeze. There are no risers, no voltage issues, GPU0 gets warmer that GPU1 - PEBKAC, I guess this is why I stopped working help desk in after college.

tateconcepts commented 7 years ago

All you can close this issue. I also tried these on different model MSI cards 1060/1070 and currently have them fairly stable at this point. If I recall, I think there was a parameter or two that the cards did not like and therefore the miner would not attempt to restart to anything. Once I removed it, it was resolved and I haven't messed with it since. At this moment, I have everything fairly stable and undervolted with OC but thank you again for the reply.