martomi / chiadog

A watch dog providing a peace in mind that your Chia farm is running smoothly 24/7.
MIT License
458 stars 120 forks source link

ChiaDog is not recovering from a remote harvester being down #283

Open Jacek-ghub opened 3 years ago

Jacek-ghub commented 3 years ago

Hi, I have ChiaDog running on a CentOS box. I mapped my harvesters to local folders. Works great.

However, when a harvester box is restarted, ChiaDog is stuck on not seeing that log file anymore, until I restart ChiaDog for that harvester. Maybe when ChiaDog is detecting harvester down (no access to the file), it should try to check whether the file access has been restored?

A clear and concise description of what the bug is and how it can be reproduced.

Environment:

Here is the exception generated when harvester went down:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/chia_logs/chiadog/ox/venv/lib/python3.9/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/mnt/chia_logs/chiadog/ox/venv/lib/python3.9/site-packages/retry/api.py", line 73, in retry_decorator
    return __retry_internal(partial(f, *args, **kwargs), exceptions, tries, delay, max_delay, backoff, jitter,
  File "/mnt/chia_logs/chiadog/ox/venv/lib/python3.9/site-packages/retry/api.py", line 33, in __retry_internal
    return f()
  File "/mnt/chia_logs/chiadog/ox/src/chia_log/log_consumer.py", line 75, in _consume_loop
    for log_line in Pygtail(self._expanded_log_path, read_from_end=True, offset_file=self._offset_path):
  File "/mnt/chia_logs/chiadog/ox/venv/lib/python3.9/site-packages/pygtail/core.py", line 89, in __init__
    if self._offset_file_inode != stat(self.filename).st_ino or \
OSError: [Errno 112] Host is down: '/mnt/chia_logs/ox/debug.log'
Exception ignored in: <function Pygtail.__del__ at 0x7f8f87633c10>
Traceback (most recent call last):
  File "/mnt/chia_logs/chiadog/ox/venv/lib/python3.9/site-packages/pygtail/core.py", line 97, in __del__
    if self._filehandle():
  File "/mnt/chia_logs/chiadog/ox/venv/lib/python3.9/site-packages/pygtail/core.py", line 179, in _filehandle
    self._fh = open(filename, "r", 1)
OSError: [Errno 112] Host is down: '/mnt/chia_logs/ox/debug.log'
sorenfriis commented 3 years ago

I see the same issue when using the network_log_consumer with SSH When the connection is lost (e.g. lost WiFi), it is never restored, and I have to restart the chia-dog instance, to reestablish the connection and consume the log files.

Jacek-ghub commented 3 years ago

@sorenfriis Is there any reason why you would prefer to expose the whole box (using SSH), vs. just local mapping the log folder with read only privileges? You can map Samba or NFS, so any box/OS combination will work.

sorenfriis commented 3 years ago

@Jacek-ghub I am only letting chiadog in over SSH with a dedicated user who only has read access to the log file

Jacek-ghub commented 3 years ago

I would also suggest that just one notification about the harvester being down event is being sent. I guess, we all know what to do when we get notified, so those extra notifications are both redundant and (to me only?) annoying.

Saying that, I would also like to see a notification when a bunch of plots is being added (what would indicate connecting a new drive with plots - moving HDs around). That notification would be most often complementary to the one that is being sent when plots are disappearing from the harvester (HD unplugged from the plotter). This way, it would be a good notification that the added drive was recognized by the harvester, so we would not need to relay on rather hopeless full node UI.

martomi commented 3 years ago

Like the suggestions & ideas! Happy to provide guidance if you or anyone else is interested to tackle them in code :-)

Jacek-ghub commented 3 years ago

Sorry, I don't know anything about Python, so potentially my questions will be rather dumb. I did test changes to daily status messages, but it was a pain sitting in a root folder, and trying to grep stuff.

Which files are involved in opening those log files?

martomi commented 3 years ago

You can see high-level architecture diagram here - it should make the file structure more intuitive. The log consumers are defined in log_consumer.py.

Since you have mapped the remote log file to the local filesystem, the most relevant part of the code is in the FileLogConsumer here (we use pygtail): https://github.com/martomi/chiadog/blob/dd9f46d4f74c1cc4105c9095ef79af69d6b95b79/src/chia_log/log_consumer.py#L75

ZwaZo22 commented 2 years ago

Hi, I have ChiaDog running on a CentOS box. I mapped my harvesters to local folders. Works great.

However, when a harvester box is restarted, ChiaDog is stuck on not seeing that log file anymore, until I restart ChiaDog for that harvester. Maybe when ChiaDog is detecting harvester down (no access to the file), it should try to check whether the file access has been restored?

Got the same behaviour here. The only way I found right now is to kill the chiadog process and restart it.