transeos / ethos_monitor

14 stars 13 forks source link

use local files instead of arguments #1

Open krtschmr opened 7 years ago

krtschmr commented 7 years ago

panel can be totally ignored since we have the data in all local files available (those are the data that gets reported anyways).

we can simply run the script, no need for any configurations. makes it way easier.

transeos commented 7 years ago

Sorry, I got busy with another got busy with another project. I did the change just now. I'm doing the commit within an hr.

Also, accessing panel checks whether internet is working. If internet is down, no need for repeated reboot in every 8 min. Obviously there are other ways to check for internet, i thought checking for panel info is not bad approach.

krtschmr commented 7 years ago

in case the panel have falsy stats, no-updates (which happens sometimes for a few rigs if they go zombie-load !) or any other reasons we cant get the data we then actually dont get any current data.

i have sometimes a rig that goes high load but hashes like a champ. it simply is in high-load so i can't ssh into it and he can't update (but he hashes!). if a card fails, we will never see it via panel but via claymore-ethminer.exe or if we check locally. i think it's the better approach. the data is locally, why gather it remotely?

checking do we have internet is a nice thing tho

transeos commented 7 years ago

It happened to me once while mining xmr. Instead of gpu mining, it probably started cpu mining.

I think if the panel is not getting refreshed even once in 8 min, there is some problem which should be looked at.

transeos commented 7 years ago

I've pushed another change to handle above situation.

krtschmr commented 7 years ago
invalid url @ 2017-09-11 03:26:50.628405
invalid url @ 2017-09-11 03:30:50.651621

seems like something is wrong here

transeos commented 7 years ago

Please change the rig name and panel address to xxx if required and show me the output of "/home/ethos/gpu_crash.log".

krtschmr commented 7 years ago

i did send you an email

transeos commented 7 years ago

trying a workaround

krtschmr commented 7 years ago

i wonder how this can actually happen since the url should be always in the files.

maybe the best is to dump the json and switch to local reads, then we avoid this source of error.

how many rigs do you have to try out?

transeos commented 7 years ago

I've made rig name and panel url as optional arguments so that you can use them on those rigs where you are running into error.

krtschmr commented 7 years ago

today from ethosdistro channel "Configmaker / Stats panel temporarily down. Will update with ETA when available. website/update/update2/get/paste are online"

so, i will fork and make everything locally :P

transeos commented 7 years ago

I'm also facing some issue.

krtschmr commented 7 years ago
    miner_hashes = map( float, commands.getstatusoutput("cat /var/run/ethos/miner_hashes.file")[1].split("\n")[-1].split() )
    numGpus = int(commands.getstatusoutput("cat /var/run/ethos/gpucount.file")[1])
    numRunningGpus = len(filter(lambda a: a > 0, miner_hashes))

we can use these and everything should work?

krtschmr commented 7 years ago

idea: this shorty does kinda the same?

https://pastebin.com/s4VewKJB edit: this even better: https://pastebin.com/8Zu5G5rA

transeos commented 7 years ago

Thanks.

I'll have a look later.

krtschmr commented 7 years ago

https://github.com/krtschmr/ethos_monitor/blob/master/check_crash.py

this works perfect now, including autoupdate before he reboots, in case we changed anything. i'll run this version for my farm now (but somehow my farm is stable since then. weired ;) )

transeos commented 7 years ago

Sorry, I'll be too busy in next 2 days to review this change.

ghost commented 6 years ago

@krtschmr is it work on ethos 1.2.9 ?

krtschmr commented 6 years ago

@LazyScream absolutely. However 1.2.9 wasn't stable for my farm so i kept them at 1.2.7. The script itself will work forever untill they have major changes to the GPU-statistic.

ghost commented 6 years ago

@krtschmr I found you do not need "rigname" and "ethosdistro.com/?json=yes" in your release So just put check_crash.py under / home / ethos,And add "@reboot /home/ethos/ethos_monitor/check_crash.py" to crontab -e, your script will run automatically right?

krtschmr commented 6 years ago

almost :-)

wget https://raw.githubusercontent.com/krtschmr/ethos_monitor/master/check_crash.py
crontab -e
@reboot /home/ethos/check_crash.py
ctrl+o
python check_crash.py & # or you can run "r" for reboot
ghost commented 6 years ago

@krtschmr ok ! thx all ! and do you have any ides for join 「Pushover」on this scrip ?

krtschmr commented 6 years ago

ya, google knows


import http.client, urllib
conn = http.client.HTTPSConnection("api.pushover.net:443")
conn.request("POST", "/1/messages.json",
  urllib.parse.urlencode({
    "token": "APP_TOKEN",
    "user": "USER_KEY",
    "message": "RIG OFFLINE!!! OMG, we are boke!",
  }), { "Content-type": "application/x-www-form-urlencoded" })
conn.getresponse()
ghost commented 6 years ago

Copy the code to any place on it There are replacement APP_TOKEN, USER_KEY? //// i got some error File "./check_crash.py", line 22, in import http.client ImportError: No module named http.client

krtschmr commented 6 years ago

i really can't help with that, i'm not a specialist in python. obviously you need to bundle the http package first.

Trigun87 commented 6 years ago

i made a reboot function with telegram warning

from urllib import urlopen
from urllib import quote

def RebootRig():
  DumpActivity("Rebooting (" + str(miner_hashes) + ")")
  uptime = float(commands.getstatusoutput("cat /proc/uptime")[1].split()[0])
  m, s = divmod(uptime, 60)
  h, m = divmod(m, 60)
  msg = quote("Rig1 Reboot uptime " + str(h) + ":" + str(m) + ":" + str(s))
  urlopen("https://api.telegram.org/botXXX:APIKEY/sendmessage?chat_id=ID&text=" + msg).read()
  os.system("sudo hard-reboot")
  os.system("sudo reboot")

and now i'm using @krtschmr version now i need only to test if the uptime var is workiing ^_^ (just use telegram botfather for make a new bot and get api)

Trigun87 commented 6 years ago

i think i found a bug on @krtschmr version... in the disconnectcount part the script will check 12 times (without waiting) and after that it will trigger the break and the script stop i think you need to place a reboot or a continue or something else and a time.sleep too i changed in this way

 if (numRunningGpus != numGpus or numGpus != 13):

    if (waitForReconnect == 1 and numRunningGpus == 0):
      # all GPUs dead. propably TCP disconnect / pool issue
      # we wait 12 times to resolve these issues. this equals to 3 minutes. most likely appears with nicehash.
      disconnectCount += 1
      if (disconnectCount > 12):
        DumpActivity("Waiting for hashes back: " + str(disconnectCount))
        RebootRig()
        break
    else:
     disconnectCount = 0

    RebootRig()
    break
  time.sleep(15)
jmverges commented 6 years ago

@krtschmr is what is saying @Trigun87 true?

krtschmr commented 6 years ago

i don't know yet, had no time to look into, still trying to get new 600 gpu farm stable....

i can fix it later

jmverges commented 6 years ago

600 gpu? 😮

Trigun87 commented 6 years ago

ok i fixed the check for disconnect (the var waitForReconnect was useless since was always 1)

https://github.com/Trigun87/ethos_monitor

i just forked ^_^ i use a new file for telegram warning (default disabled) and number of gpus on the rig (if start with less gpu it will reboot)

krtschmr commented 6 years ago

@Trigun87 wanna merge into my one?

Trigun87 commented 6 years ago

@krtschmr if u like my version ^_^ (btw is something u should do or something i should do ? never merged anything :-P)

krtschmr commented 6 years ago

@jmverges how to work in this ethOS FRiends group? i cant create repositories or do anything...

@Trigun87 check gist:

so, my problem is that nicehash terminates the connections sometimes, and/or i dont have work. if i reboot, then they are hashing. sometimes 3/4 farm is dead over night. the issue is the reboot script. ethos 1.2.7 ( all <1.2.9) have issues then with claymore, still reporting SOME hashrate, even tho it's zero. i can't upgrade to 1.3.0 since powerplay messes up and we would use 8% more electricity

this should fix it. maybe usefull for anybody? https://gist.github.com/krtschmr/a915ee7fa9c9c42961a2376dfebf208b