Disabling harvester_activity_handler does not disable harvester checks

jinnatar commented 1 year ago

Even if the handler harvester_activity_handler is set to enable: false, a harvester is still expected: Your harvester appears to be offline! No events for the past 600 seconds.

For context, this instance of chiadog is monitoring a full_node that has been started with services="node farmer-only wallet", i.e. it's not running a harvester on purpose since there are no local plots.
Issue repeats with Chia log level set to DEBUG or INFO, have not tested other levels.

[2023-01-16 15:29:28] [    INFO] --- Starting Chiadog (unknown) (main.py:54)
[2023-01-16 15:29:28] [    INFO] --- Enabled local file log consumer. (log_consumer.py:60)
[2023-01-16 15:29:28] [    INFO] --- Using temporary directory /tmp/tmpwt8e5db5/debug.log.offset (log_consumer.py:63)
[2023-01-16 15:29:28] [    INFO] --- Enabled remote pinging to https://hc-ping.com/xxxxx (keep_alive_monitor.py:36)
[2023-01-16 15:29:28] [    INFO] --- Keep-alive check period: 300 seconds (keep_alive_monitor.py:45)
[2023-01-16 15:29:28] [    INFO] --- Initializing Pushover notifier. (pushover_notifier.py:13)
[2023-01-16 15:29:28] [    INFO] --- Enabled stats for daily notifications (stats_manager.py:48)
[2023-01-16 15:29:28] [    INFO] --- Summary notifications will be sent out every 24 hours starting from 12:00 (stats_manager.py:61)
[2023-01-16 15:29:28] [    INFO] --- Disabled handler: harvester_activity_handler (log_handler.py:63)
[2023-01-16 15:29:28] [    INFO] --- Initializing handler: partial_handler (__init__.py:28)
[2023-01-16 15:29:28] [    INFO] --- Enabled parser for partial submitting stats. (partial_parser.py:28)
[2023-01-16 15:29:28] [    INFO] --- Initializing handler: block_handler (__init__.py:28)
[2023-01-16 15:29:28] [    INFO] --- Enabled parser for block found stats. (block_parser.py:28)
[2023-01-16 15:29:28] [    INFO] --- Enabled check for found blocks. (found_blocks.py:15)
[2023-01-16 15:29:28] [    INFO] --- Initializing handler: finished_signage_point_handler (__init__.py:28)
[2023-01-16 15:29:28] [    INFO] --- Enabled parser for finished signage points. (finished_signage_point_parser.py:28)
[2023-01-16 15:29:28] [    INFO] --- Enabled check for finished signage points. (non_skipped_signage_points.py:19)
[2023-01-16 15:29:28] [    INFO] --- Initializing handler: wallet_added_coin_handler (__init__.py:28)
[2023-01-16 15:29:28] [    INFO] --- Enabled parser for wallet activity - added coins. (wallet_added_coin_parser.py:26)
[2023-01-16 15:29:28] [    INFO] --- Filtering transaction with mojos less than 5 (wallet_added_coin_handler.py:26)
[2023-01-16 15:39:29] [ WARNING] --- Your harvester appears to be offline! No events for the past 600 seconds. (keep_alive_monitor.py:83)
[2023-01-16 15:44:29] [ WARNING] --- Your harvester appears to be offline! No events for the past 901 seconds. (keep_alive_monitor.py:83)

Relevant config sections:

keep_alive_monitor:                                                                                                             
    enable_remote_ping: true
    ping_url: 'https://hc-ping.com/xxxxx'

handlers:
  wallet_added_coin_handler:
    enable: true
    min_mojos_amount:  5
  finished_signage_point_handler:
    enable: true
  block_handler:
    enable: true
  partial_handler:
    enable: true
  harvester_activity_handler:
    enable: false

Environment:

OS: ghcr.io/chia-network/chia:latest
Python version: N/A
PIP version: N/A
Chia version: 1.6.2
Chiadog version: v0.7.5 as present in sha256:4c19849657d2fc78c91257f1c18c8a18ac6a99d094e4d6c3e530300d3905a291
Harvester: No local harvester, connected to two remote harvesters.

jinnatar commented 1 year ago

Looking at the code, it seems the keep_alive_monitor is hardcoded to assume a harvester: https://github.com/martomi/chiadog/blob/main/src/notifier/keep_alive_monitor.py#L30

jinnatar commented 1 year ago

After a deeper peek, the root cause is a bit deeper. There's only one handler that emits KEEPALIVE events, which is harvester_activity_handler.py which in turn is keyed off seeing harvesters participate in challenges. Ergo, there is no keepalive for any other services except the harvester.

So the "proper" solution to this would be to add KEEPALIVE events to the other handlers when they see activity proving health for that service. However in scope of not spamming harvester being down when one doesn't exist on purpose, I'd say the keep_alive_monitor needs to be initialized empty instead of always assuming a harvester is present. That way the first keepalive seen from a service starts monitoring for it. I'll have a poke at making a PR.

martomi commented 1 year ago

You’re on the right path :) I added the config option for disabling handlers (relatively) recently in version 0.7.1 (here). And indeed, that config only applies to the actual log handlers but doesn’t affect the behavior of the keep_alive_monitor.

Note that there’s also a long-standing thread on this here: https://github.com/martomi/chiadog/issues/93

Happy to look into a PR if you come up with some proposal to solve this. I think KEEPALIVE events can only come from either the signage point handler or the harvester handler? And we should keep defaulting to the harvester since in many setups where chiadog is installed there’s only a harvester and no full node.

jinnatar commented 1 year ago

The way I'm looking at the keepalive setup, any service posting KEEPALIVE events would prime the system for that service. Relying on that fully however means that a service that was already down at the time of chiadog starting is never notified as down.

I'll have a go of just reading the enabled handlers from the config and prepopping based on that, which would mean that with a default config, all keepalives are expected. This of course means I'll need to ensure all services do have a keepalive event sender :D

Thanks for the thread context! I think it's plausible to manufacture keepalive signals for any service, as long as there's a log message we know of that proves it's functioning. For the wallet for example we could parse a message like:

2023-01-18T12:45:07.268 wallet chia.wallet.wallet_blockchain: INFO Peak set to: 3123251 timestamp: 1674038673

and check that the diff between the timestamp value and us reading the message is less than say 30 seconds, which gives a reasonable estimate that the wallet is healthy and up to speed.

martomi commented 1 year ago

Agreed!

jinnatar commented 1 year ago

@martomi: Working on this I've come across a few architectural design choices that may need to be questioned and/or changed, so a handful of questions:

Handlers are currently standalone and not grouped by service.
- I could just add a separate wallet_peak_handler but it gets unwieldy in both the config and in the code to keep expanding this pattern. Alternatively would it make sense in config to treat handlers per service instead of as individual handlers? What I mean is, have config items and the enable bit per service instead of per individual handler.
- Keeping it per individual handler stands to explode the size of the config where I need to add new handlers, since every handler is enabled one by one and the current config system does not allow defaulting them to off or on.
- The practical problem here is simple, setting keepalive values is assumed per service in the keepalive messaging system, so configuring them at the level of individual handlers will be confusing since they apply at the service level. The alternative would be to have a separate list of keepalives to enable and their values, but that's duplication with enabled handlers.
- The most transparent & backwards compatible method I think would be to define in code a mapping of services to handlers and add the keepalive values per service to the keepalive config section. This mapping could also be used to init handlers to avoid having to touch handler init in redundant places.
The current config class is a bit "unhelpful" in that it strictly defines what is expected in the config and bails if that's not met.
- This does not allow for defining default values and only overriding the ones the user wants, again contributing to config bloat. If enabling new keepalives across multiple releases, this means multiple times of people needing to amend their configs to get "full service". An alternative could be detecting from log output which handlers should be enabled instead of having it user configurable. But I imagine this would split opinions. (And could cause log format changes to silently disable keepalives.)
- One solution is swapping this out for a YAML configuration library that can handle defaults, for example https://github.com/beetbox/confuse seems fitting for the job. I'd be happy to PR this change and only then layer the keepalive changes to take advantage of the more flexible config parsing. The alternative is continuing the current practice where folk's handler configs may become obsolete and incompatible with no fallback to sane defaults. This seems to be the "Chia Way" of doing configs, but it's just as shitty from them. :-)
No matter how the config is parsed, the current structure of handlers -> maps of individual handlers seems to become unwieldy. At a minimum I'd propose changing it to handlers -> {fullnode,harvester,farmer,wallet} -> maps of individual handlers

If you can share any insight into what matches the style of the project as the direction of evolution, I can then make more informed proposals. :-)

martomi commented 1 year ago

Thank you for the detailed write-up - really appreciate the thoughtfulness about the direction and future implications!

Handler Groupings

The mapping service -> [handler] makes sense to me. I imagine a config in the future could look like (you might have better ideas):

monitored_services:
    - fullnode
    - harvester
    - farmer
    - wallet

handler_configs:
    wallet_added_coint_handler:
        min_mojos_amount: 5

Where enabled_services will implicitly enable all handlers linked to the specified service without ability for finer grain control on handlers (I don’t see a use-case). Also I think all services in that list should be monitored in the keep_alive_monitor without any additional config in that section, what do you think?

And handler_configs would be just for some extra optional config overrides. But that section would be completely optional as we’ll want to have sane defaults here.

We could also go for a more nested approach in the config as in services > handlers. My opinion isn’t strong here. It just feels cleaner to keep it flatter for some reason.

Config

I’m 100% onboard with handling the tech debt in the YAML configuration as a potential first step. In the past, we’ve been careful to maintain backwards compatibility for folks by adding optional configs with sane defaults. The problem is that those defaults are configured at the point of usage of the respective config values. It should really happen in the Config class which is quite useless in its current form.

Note that, already today, the config-example.yaml contains more than necessary for documentation purposes. A functional config out in the wild can be 80% smaller and still have everything enabled.

E.g. this should work

notification_title_prefix: 'Chia'
log_level: INFO
chia_logs:
  file_log_consumer:
    enable: true
    file_path: '~/.chia/mainnet/log/debug.log'

notifier:
  pushover:
    enable: true
    credentials:
      api_token: 'dummy_token'
      user_key: 'dummy_key'

Direction of Evolution

You may have noticed that the project is mostly in a feature-freeze for the past 1+ year. I’ve been lurking around and making sure we’re merging and releasing bugfixes to not degrade the existing experience.

In general I’m happy to discuss and support community-driven evolution as long as the changes are thoughtful and (most importantly) don’t increase the maintenance burden. 😄

It should also be noted that Chia Network itself is investing resources into an official chia farming monitoring tools (video from JM). Which means that the long-term usefulness of chiadog is potentially limited.

Trying to be maximally transparent so you can calibrate how much you would like to invest in the architecture & extensibility here. Otherwise, I’m happy to move in the proposed direction!

jinnatar commented 1 year ago

Thanks for the details and disclosure. I think there's enough here for me to continue. It sounds like we're pretty aligned so I'll start working on the config migration first with an emphasis on as much backwards compatibility as sane.

jinnatar commented 1 year ago

Note to self I learned the hard way during testing: Also need a keepalive check for checking that the logfile itself is still flowing. Hooking up to a log file that isn't updating at all creates a very eerie scenario and while it does trigger keepalives, that can be misleading if instead it's not really a service not working, and actually that logging itself has "failed".

martomi / chiadog