Closed theREALeson closed 5 years ago
Same issue just happened on RC4; two GPUs and gpu #0 randomly is offline after multiple hours online.
Would be great if it could either: add a line to the log saying that a GPU is offline or have the script kill itself when one goes offline. With either addition, you could have it monitored and restarted automatically.
same here with 5x vega64 and a amd 2200g (4/4), cpu too busy or something like that on gpus offline
Try increasing your virtual ram = (ram on gpu 8 or 4) * number of GPU.
I've already 60gb virtual ram
So it happens to both nvidia and AMD solvers that use different code completely? That could narrow it down, seems to be the last big issue remaining.
I've seen it happen on my GPU-0 while I'm using the system and the card was never actually offline. (Ubuntu 18.04, 2x1080ti)
It does happen on both nvidia and AMD based on DarkAngels rig. It happens on my nvidia rigs. I definitely have enough virtual mem. I was thinking an over strained CPU, but disable 1/6 of cards and ran with 5 GPU and one still went offline. CPU optimizations may still help.
I'm on Windows 10. Looks like nymd is on Linux. So both operating systems too.
guys I need to know one critical piece of information... does the affected CudaSoover.exe process exit or hangs? Confirm by counting number of running solvers vs active gpus. thanks.
I'll let you know if/when I see it again but someone with more GPUs will probably see it first.
Just hit this same issue. Cudasolver is running for all 5 GPUs on my rig. GPU two is showing offline while 0,1,3,4 all show online. One Cudasolver is using 0% CPU (assumed to be the one tied to the GPU showing offline).
found these errors in the log at same time it appears to go offline: 2019-01-17T02:37:19Z ERROR, Listen errorUnable to find assembly 'ManagedCuda, Version=9.1.300.0, Culture=neutral, PublicKeyToken=242d898828717aa0'. 2019-01-17T02:37:19Z ERROR, Listen errorThe input stream is not a valid binary format. The starting contents (in bytes) are: 06-07-00-00-00-19-4D-61-6E-61-67-65-64-43-75-64-61 ... 2019-01-17T02:37:19Z ERROR, Listen errorThe input stream is not a valid binary format. The starting contents (in bytes) are: 2E-43-75-64-61-45-78-63-65-70-74-69-6F-6E-06-08-00 ... 2019-01-17T02:37:19Z ERROR, Listen errorThe input stream is not a valid binary format. The starting contents (in bytes) are: 00-00-AD-02-45-72-72-6F-72-49-6C-6C-65-67-61-6C-41 ... 2019-01-17T02:37:19Z ERROR, Listen errorThe input stream is not a valid binary format. The starting contents (in bytes) are: 64-64-72-65-73-73-3A-20-57-68-69-6C-65-20-65-78-65 ... 2019-01-17T02:37:19Z ERROR, Listen errorThe input stream is not a valid binary format. The starting contents (in bytes) are: 63-75-74-69-6E-67-20-61-20-6B-65-72-6E-65-6C-2C-20 ... 2019-01-17T02:37:19Z ERROR, Listen errorThe input stream is not a valid binary format. The starting contents (in bytes) are: 74-68-65-20-64-65-76-69-63-65-20-65-6E-63-6F-75-6E ... 2019-01-17T02:37:19Z ERROR, Listen errorThe input stream is not a valid binary format. The starting contents (in bytes) are: 74-65-72-65-64-20-61-20-6C-6F-61-64-20-6F-72-20-73 ...
Thanks, I think I know what the problem might be. Will try to have a fix soon.
same issue, thanks for quick response @phooton.
any progress on this? very frustrating to constantly monitor to make sure GPU has not dropped offline. if its something that going to tak a while to be fixed i will just write program to monitor stdout for Offline gpu then quit and restart miner. would obviously prefer not to waist time on this though.
A new test build will appear in release 2.7 section in half an hour that should either resolve the issue or identify the cause. Based on latest master branch.
@phooton is there beta for linux?
Oddly enough...I have a different rig that now gets this issue with RC7 and 8-1, but doesn't have the issue with RC4...which I'm not supposed to use on grinmint fml
Please do a version of 2.8 stable, without the fix for this issue...whatever was changed is causing more GPUs to go offline. I've managed to get low quality CPU rigs to not drop GPUs by turning off antivirus, afterburner, etc.
Not to argue with your experience, but I was getting the issue and I only have two cards and the latest Ryzen; not an underpowered CPU situation.
have the new versions made it better for you? I've got rigs that didnt have the issue, now having it.
There is no fix to revert. The cause is either HW error or incredibly rare illegal memory access in GPU code. It bricks the whole worker process. I'm both researching the cause and implementing a fix that will monitor faulty GPUs and restart - it will appear first in GrinPro closed source GGM variant where GPU watchdog is already present and if that works it will be back ported to GGM. I'll need to gather some logs from multiple rigs. GPU worker must exit cleanly and be relaunched successfully for this to work. I'm mining on 6 cards and I have yet to see this once so I'm puzzled.
with lastest version i've after ~15-45 minutes one gpu offline and openclsolver totaly hanged (it can't be terminated with taskmanger), only way to mine with ggm or other miner is restart windows, reset vegas hangs windows
That sucks... grin-miner is running my OpenCL code so if the code is the problem, it would be hanging cards as well. Clearly only some rigs appear to be affected.
I'm using 18.6.1, large pages on, hbcc off and it's stable with sbrminer for weeks
still no fix ? one of my 3 rigs have the same problem :(
I also have this issue. Windows 10 + 8 AMD CARDS 8GB + 64GB of virtual memory. Randomly always GPU 0 will go offline (OCLsolver will crash and need to force close it to restart the miner).
@Anycubic you found the problem ? i still have exact the same rigs and problem than you
@bruslie I think it could be related to core overclocking, try to overclock less and see what happens. I was having rig shutdown even with stock clock/voltages
it happens if i set p3 to 950 ore 1100 :( my other rigs run with 1120 with no problems
Hi, has a solution to this problem been found? I my rig used to be able to run a gtx 1050ti and a gtx 1080. They were both working. The first to go was the 1050ti. Now both are no longer working. I though the 1050ti stopped working because of memory problem, so I just let the 1080 work on its own. Then one day I had an error message that says my account has no connected rig. I rolled back my gpu driver to some older ones and now it is saying my 1080 is there but it is offline. Any suggestion to solve this problem will be appreciated. Thanks.
I've tried with no overclock as well, one of six GTX 1070 stops mining after a relatively short, random period of time. Rig runs stable on other coins. CPU is a celeron G3900. Any suggestions? Have tried both RC3 and RC4.