Open krtschmr opened 7 years ago
Sorry, I got busy with another got busy with another project. I did the change just now. I'm doing the commit within an hr.
Also, accessing panel checks whether internet is working. If internet is down, no need for repeated reboot in every 8 min. Obviously there are other ways to check for internet, i thought checking for panel info is not bad approach.
in case the panel have falsy stats, no-updates (which happens sometimes for a few rigs if they go zombie-load !) or any other reasons we cant get the data we then actually dont get any current data.
i have sometimes a rig that goes high load but hashes like a champ. it simply is in high-load so i can't ssh into it and he can't update (but he hashes!). if a card fails, we will never see it via panel but via claymore-ethminer.exe or if we check locally. i think it's the better approach. the data is locally, why gather it remotely?
checking do we have internet is a nice thing tho
It happened to me once while mining xmr. Instead of gpu mining, it probably started cpu mining.
I think if the panel is not getting refreshed even once in 8 min, there is some problem which should be looked at.
I've pushed another change to handle above situation.
invalid url @ 2017-09-11 03:26:50.628405
invalid url @ 2017-09-11 03:30:50.651621
seems like something is wrong here
Please change the rig name and panel address to xxx if required and show me the output of "/home/ethos/gpu_crash.log".
i did send you an email
trying a workaround
i wonder how this can actually happen since the url should be always in the files.
maybe the best is to dump the json and switch to local reads, then we avoid this source of error.
how many rigs do you have to try out?
I've made rig name and panel url as optional arguments so that you can use them on those rigs where you are running into error.
today from ethosdistro channel "Configmaker / Stats panel temporarily down. Will update with ETA when available. website/update/update2/get/paste are online"
so, i will fork and make everything locally :P
I'm also facing some issue.
miner_hashes = map( float, commands.getstatusoutput("cat /var/run/ethos/miner_hashes.file")[1].split("\n")[-1].split() )
numGpus = int(commands.getstatusoutput("cat /var/run/ethos/gpucount.file")[1])
numRunningGpus = len(filter(lambda a: a > 0, miner_hashes))
we can use these and everything should work?
idea: this shorty does kinda the same?
https://pastebin.com/s4VewKJB edit: this even better: https://pastebin.com/8Zu5G5rA
Thanks.
I'll have a look later.
https://github.com/krtschmr/ethos_monitor/blob/master/check_crash.py
this works perfect now, including autoupdate before he reboots, in case we changed anything. i'll run this version for my farm now (but somehow my farm is stable since then. weired ;) )
Sorry, I'll be too busy in next 2 days to review this change.
@krtschmr is it work on ethos 1.2.9 ?
@LazyScream absolutely. However 1.2.9 wasn't stable for my farm so i kept them at 1.2.7. The script itself will work forever untill they have major changes to the GPU-statistic.
@krtschmr I found you do not need "rigname" and "ethosdistro.com/?json=yes" in your release So just put check_crash.py under / home / ethos,And add "@reboot /home/ethos/ethos_monitor/check_crash.py" to crontab -e, your script will run automatically right?
almost :-)
wget https://raw.githubusercontent.com/krtschmr/ethos_monitor/master/check_crash.py
crontab -e
@reboot /home/ethos/check_crash.py
ctrl+o
python check_crash.py & # or you can run "r" for reboot
@krtschmr ok ! thx all ! and do you have any ides for join 「Pushover」on this scrip ?
ya, google knows
import http.client, urllib
conn = http.client.HTTPSConnection("api.pushover.net:443")
conn.request("POST", "/1/messages.json",
urllib.parse.urlencode({
"token": "APP_TOKEN",
"user": "USER_KEY",
"message": "RIG OFFLINE!!! OMG, we are boke!",
}), { "Content-type": "application/x-www-form-urlencoded" })
conn.getresponse()
Copy the code to any place on it
There are replacement APP_TOKEN, USER_KEY?
////
i got some error
File "./check_crash.py", line 22, in
i really can't help with that, i'm not a specialist in python. obviously you need to bundle the http package first.
i made a reboot function with telegram warning
from urllib import urlopen
from urllib import quote
def RebootRig():
DumpActivity("Rebooting (" + str(miner_hashes) + ")")
uptime = float(commands.getstatusoutput("cat /proc/uptime")[1].split()[0])
m, s = divmod(uptime, 60)
h, m = divmod(m, 60)
msg = quote("Rig1 Reboot uptime " + str(h) + ":" + str(m) + ":" + str(s))
urlopen("https://api.telegram.org/botXXX:APIKEY/sendmessage?chat_id=ID&text=" + msg).read()
os.system("sudo hard-reboot")
os.system("sudo reboot")
and now i'm using @krtschmr version now i need only to test if the uptime var is workiing ^_^ (just use telegram botfather for make a new bot and get api)
i think i found a bug on @krtschmr version... in the disconnectcount part the script will check 12 times (without waiting) and after that it will trigger the break and the script stop i think you need to place a reboot or a continue or something else and a time.sleep too i changed in this way
if (numRunningGpus != numGpus or numGpus != 13):
if (waitForReconnect == 1 and numRunningGpus == 0):
# all GPUs dead. propably TCP disconnect / pool issue
# we wait 12 times to resolve these issues. this equals to 3 minutes. most likely appears with nicehash.
disconnectCount += 1
if (disconnectCount > 12):
DumpActivity("Waiting for hashes back: " + str(disconnectCount))
RebootRig()
break
else:
disconnectCount = 0
RebootRig()
break
time.sleep(15)
@krtschmr is what is saying @Trigun87 true?
i don't know yet, had no time to look into, still trying to get new 600 gpu farm stable....
i can fix it later
600 gpu? 😮
ok i fixed the check for disconnect (the var waitForReconnect was useless since was always 1)
https://github.com/Trigun87/ethos_monitor
i just forked ^_^ i use a new file for telegram warning (default disabled) and number of gpus on the rig (if start with less gpu it will reboot)
@Trigun87 wanna merge into my one?
@krtschmr if u like my version ^_^ (btw is something u should do or something i should do ? never merged anything :-P)
@jmverges how to work in this ethOS FRiends group? i cant create repositories or do anything...
@Trigun87 check gist:
so, my problem is that nicehash terminates the connections sometimes, and/or i dont have work. if i reboot, then they are hashing. sometimes 3/4 farm is dead over night. the issue is the reboot script. ethos 1.2.7 ( all <1.2.9) have issues then with claymore, still reporting SOME hashrate, even tho it's zero. i can't upgrade to 1.3.0 since powerplay messes up and we would use 8% more electricity
this should fix it. maybe usefull for anybody? https://gist.github.com/krtschmr/a915ee7fa9c9c42961a2376dfebf208b
panel can be totally ignored since we have the data in all local files available (those are the data that gets reported anyways).
we can simply run the script, no need for any configurations. makes it way easier.