mozkomor / GrinGoldMiner

Grin open-source miner
GNU General Public License v3.0
175 stars 71 forks source link

A single GPU goes offline after random period of time #50

Closed theREALeson closed 5 years ago

theREALeson commented 5 years ago

I've tried with no overclock as well, one of six GTX 1070 stops mining after a relatively short, random period of time. Rig runs stable on other coins. CPU is a celeron G3900. Any suggestions? Have tried both RC3 and RC4.

untitled

nymd commented 5 years ago

Same issue just happened on RC4; two GPUs and gpu #0 randomly is offline after multiple hours online.

nymd commented 5 years ago

Would be great if it could either: add a line to the log saying that a GPU is offline or have the script kill itself when one goes offline. With either addition, you could have it monitored and restarted automatically.

TheDarkAngel666 commented 5 years ago

same here with 5x vega64 and a amd 2200g (4/4), cpu too busy or something like that on gpus offline

nerdatwork commented 5 years ago

Try increasing your virtual ram = (ram on gpu 8 or 4) * number of GPU.

TheDarkAngel666 commented 5 years ago

I've already 60gb virtual ram

phooton commented 5 years ago

So it happens to both nvidia and AMD solvers that use different code completely? That could narrow it down, seems to be the last big issue remaining.

nymd commented 5 years ago

I've seen it happen on my GPU-0 while I'm using the system and the card was never actually offline. (Ubuntu 18.04, 2x1080ti)

theREALeson commented 5 years ago

It does happen on both nvidia and AMD based on DarkAngels rig. It happens on my nvidia rigs. I definitely have enough virtual mem. I was thinking an over strained CPU, but disable 1/6 of cards and ran with 5 GPU and one still went offline. CPU optimizations may still help.

theREALeson commented 5 years ago

I'm on Windows 10. Looks like nymd is on Linux. So both operating systems too.

phooton commented 5 years ago

guys I need to know one critical piece of information... does the affected CudaSoover.exe process exit or hangs? Confirm by counting number of running solvers vs active gpus. thanks.

nymd commented 5 years ago

I'll let you know if/when I see it again but someone with more GPUs will probably see it first.

robertdavis1 commented 5 years ago

Just hit this same issue. Cudasolver is running for all 5 GPUs on my rig. GPU two is showing offline while 0,1,3,4 all show online. One Cudasolver is using 0% CPU (assumed to be the one tied to the GPU showing offline).

robertdavis1 commented 5 years ago

found these errors in the log at same time it appears to go offline: 2019-01-17T02:37:19Z ERROR, Listen errorUnable to find assembly 'ManagedCuda, Version=9.1.300.0, Culture=neutral, PublicKeyToken=242d898828717aa0'. 2019-01-17T02:37:19Z ERROR, Listen errorThe input stream is not a valid binary format. The starting contents (in bytes) are: 06-07-00-00-00-19-4D-61-6E-61-67-65-64-43-75-64-61 ... 2019-01-17T02:37:19Z ERROR, Listen errorThe input stream is not a valid binary format. The starting contents (in bytes) are: 2E-43-75-64-61-45-78-63-65-70-74-69-6F-6E-06-08-00 ... 2019-01-17T02:37:19Z ERROR, Listen errorThe input stream is not a valid binary format. The starting contents (in bytes) are: 00-00-AD-02-45-72-72-6F-72-49-6C-6C-65-67-61-6C-41 ... 2019-01-17T02:37:19Z ERROR, Listen errorThe input stream is not a valid binary format. The starting contents (in bytes) are: 64-64-72-65-73-73-3A-20-57-68-69-6C-65-20-65-78-65 ... 2019-01-17T02:37:19Z ERROR, Listen errorThe input stream is not a valid binary format. The starting contents (in bytes) are: 63-75-74-69-6E-67-20-61-20-6B-65-72-6E-65-6C-2C-20 ... 2019-01-17T02:37:19Z ERROR, Listen errorThe input stream is not a valid binary format. The starting contents (in bytes) are: 74-68-65-20-64-65-76-69-63-65-20-65-6E-63-6F-75-6E ... 2019-01-17T02:37:19Z ERROR, Listen errorThe input stream is not a valid binary format. The starting contents (in bytes) are: 74-65-72-65-64-20-61-20-6C-6F-61-64-20-6F-72-20-73 ...

phooton commented 5 years ago

Thanks, I think I know what the problem might be. Will try to have a fix soon.

HudsonProdigy commented 5 years ago

same issue, thanks for quick response @phooton.

HudsonProdigy commented 5 years ago

any progress on this? very frustrating to constantly monitor to make sure GPU has not dropped offline. if its something that going to tak a while to be fixed i will just write program to monitor stdout for Offline gpu then quit and restart miner. would obviously prefer not to waist time on this though.

phooton commented 5 years ago

A new test build will appear in release 2.7 section in half an hour that should either resolve the issue or identify the cause. Based on latest master branch.

HudsonProdigy commented 5 years ago

@phooton is there beta for linux?

theREALeson commented 5 years ago

Oddly enough...I have a different rig that now gets this issue with RC7 and 8-1, but doesn't have the issue with RC4...which I'm not supposed to use on grinmint fml

theREALeson commented 5 years ago

Please do a version of 2.8 stable, without the fix for this issue...whatever was changed is causing more GPUs to go offline. I've managed to get low quality CPU rigs to not drop GPUs by turning off antivirus, afterburner, etc.

nymd commented 5 years ago

Not to argue with your experience, but I was getting the issue and I only have two cards and the latest Ryzen; not an underpowered CPU situation.

theREALeson commented 5 years ago

have the new versions made it better for you? I've got rigs that didnt have the issue, now having it.

phooton commented 5 years ago

There is no fix to revert. The cause is either HW error or incredibly rare illegal memory access in GPU code. It bricks the whole worker process. I'm both researching the cause and implementing a fix that will monitor faulty GPUs and restart - it will appear first in GrinPro closed source GGM variant where GPU watchdog is already present and if that works it will be back ported to GGM. I'll need to gather some logs from multiple rigs. GPU worker must exit cleanly and be relaunched successfully for this to work. I'm mining on 6 cards and I have yet to see this once so I'm puzzled.

TheDarkAngel666 commented 5 years ago

with lastest version i've after ~15-45 minutes one gpu offline and openclsolver totaly hanged (it can't be terminated with taskmanger), only way to mine with ggm or other miner is restart windows, reset vegas hangs windows

phooton commented 5 years ago

That sucks... grin-miner is running my OpenCL code so if the code is the problem, it would be hanging cards as well. Clearly only some rigs appear to be affected.

TheDarkAngel666 commented 5 years ago

I'm using 18.6.1, large pages on, hbcc off and it's stable with sbrminer for weeks

bruslie commented 5 years ago

still no fix ? one of my 3 rigs have the same problem :(

Anycubic commented 5 years ago

I also have this issue. Windows 10 + 8 AMD CARDS 8GB + 64GB of virtual memory. Randomly always GPU 0 will go offline (OCLsolver will crash and need to force close it to restart the miner).

bruslie commented 5 years ago

@Anycubic you found the problem ? i still have exact the same rigs and problem than you

Anycubic commented 5 years ago

@bruslie I think it could be related to core overclocking, try to overclock less and see what happens. I was having rig shutdown even with stock clock/voltages

bruslie commented 5 years ago

it happens if i set p3 to 950 ore 1100 :( my other rigs run with 1120 with no problems

paulwriter205 commented 3 years ago

Hi, has a solution to this problem been found? I my rig used to be able to run a gtx 1050ti and a gtx 1080. They were both working. The first to go was the 1050ti. Now both are no longer working. I though the 1050ti stopped working because of memory problem, so I just let the 1080 work on its own. Then one day I had an error message that says my account has no connected rig. I rolled back my gpu driver to some older ones and now it is saying my 1080 is there but it is offline. Any suggestion to solve this problem will be appreciated. Thanks.