Open jinnatar opened 1 year ago
Looking at the code, it seems the keep_alive_monitor is hardcoded to assume a harvester: https://github.com/martomi/chiadog/blob/main/src/notifier/keep_alive_monitor.py#L30
After a deeper peek, the root cause is a bit deeper. There's only one handler that emits KEEPALIVE events, which is harvester_activity_handler.py
which in turn is keyed off seeing harvesters participate in challenges. Ergo, there is no keepalive for any other services except the harvester.
So the "proper" solution to this would be to add KEEPALIVE events to the other handlers when they see activity proving health for that service. However in scope of not spamming harvester being down when one doesn't exist on purpose, I'd say the keep_alive_monitor needs to be initialized empty instead of always assuming a harvester is present. That way the first keepalive seen from a service starts monitoring for it. I'll have a poke at making a PR.
You’re on the right path :) I added the config option for disabling handlers (relatively) recently in version 0.7.1 (here). And indeed, that config only applies to the actual log handlers but doesn’t affect the behavior of the keep_alive_monitor.
Note that there’s also a long-standing thread on this here: https://github.com/martomi/chiadog/issues/93
Happy to look into a PR if you come up with some proposal to solve this. I think KEEPALIVE events can only come from either the signage point handler or the harvester handler? And we should keep defaulting to the harvester since in many setups where chiadog is installed there’s only a harvester and no full node.
The way I'm looking at the keepalive setup, any service posting KEEPALIVE events would prime the system for that service. Relying on that fully however means that a service that was already down at the time of chiadog starting is never notified as down.
I'll have a go of just reading the enabled handlers from the config and prepopping based on that, which would mean that with a default config, all keepalives are expected. This of course means I'll need to ensure all services do have a keepalive event sender :D
Thanks for the thread context! I think it's plausible to manufacture keepalive signals for any service, as long as there's a log message we know of that proves it's functioning. For the wallet for example we could parse a message like:
2023-01-18T12:45:07.268 wallet chia.wallet.wallet_blockchain: INFO Peak set to: 3123251 timestamp: 1674038673
and check that the diff between the timestamp value and us reading the message is less than say 30 seconds, which gives a reasonable estimate that the wallet is healthy and up to speed.
Agreed!
@martomi: Working on this I've come across a few architectural design choices that may need to be questioned and/or changed, so a handful of questions:
wallet_peak_handler
but it gets unwieldy in both the config and in the code to keep expanding this pattern. Alternatively would it make sense in config to treat handlers per service instead of as individual handlers? What I mean is, have config items and the enable
bit per service instead of per individual handler.handlers -> maps of individual handlers
seems to become unwieldy. At a minimum I'd propose changing it to handlers -> {fullnode,harvester,farmer,wallet} -> maps of individual handlers
If you can share any insight into what matches the style of the project as the direction of evolution, I can then make more informed proposals. :-)
Thank you for the detailed write-up - really appreciate the thoughtfulness about the direction and future implications!
The mapping service -> [handler]
makes sense to me. I imagine a config in the future could look like (you might have better ideas):
monitored_services:
- fullnode
- harvester
- farmer
- wallet
handler_configs:
wallet_added_coint_handler:
min_mojos_amount: 5
Where enabled_services
will implicitly enable all handlers linked to the specified service without ability for finer grain control on handlers (I don’t see a use-case). Also I think all services in that list should be monitored in the keep_alive_monitor
without any additional config in that section, what do you think?
And handler_configs
would be just for some extra optional config overrides. But that section would be completely optional as we’ll want to have sane defaults here.
We could also go for a more nested approach in the config as in services > handlers
. My opinion isn’t strong here. It just feels cleaner to keep it flatter for some reason.
I’m 100% onboard with handling the tech debt in the YAML configuration as a potential first step. In the past, we’ve been careful to maintain backwards compatibility for folks by adding optional configs with sane defaults. The problem is that those defaults are configured at the point of usage of the respective config values. It should really happen in the Config
class which is quite useless in its current form.
Note that, already today, the config-example.yaml
contains more than necessary for documentation purposes. A functional config out in the wild can be 80% smaller and still have everything enabled.
E.g. this should work
notification_title_prefix: 'Chia'
log_level: INFO
chia_logs:
file_log_consumer:
enable: true
file_path: '~/.chia/mainnet/log/debug.log'
notifier:
pushover:
enable: true
credentials:
api_token: 'dummy_token'
user_key: 'dummy_key'
You may have noticed that the project is mostly in a feature-freeze for the past 1+ year. I’ve been lurking around and making sure we’re merging and releasing bugfixes to not degrade the existing experience.
In general I’m happy to discuss and support community-driven evolution as long as the changes are thoughtful and (most importantly) don’t increase the maintenance burden. 😄
It should also be noted that Chia Network itself is investing resources into an official chia farming monitoring tools (video from JM). Which means that the long-term usefulness of chiadog
is potentially limited.
Trying to be maximally transparent so you can calibrate how much you would like to invest in the architecture & extensibility here. Otherwise, I’m happy to move in the proposed direction!
Thanks for the details and disclosure. I think there's enough here for me to continue. It sounds like we're pretty aligned so I'll start working on the config migration first with an emphasis on as much backwards compatibility as sane.
Note to self I learned the hard way during testing: Also need a keepalive check for checking that the logfile itself is still flowing. Hooking up to a log file that isn't updating at all creates a very eerie scenario and while it does trigger keepalives, that can be misleading if instead it's not really a service not working, and actually that logging itself has "failed".
Even if the handler
harvester_activity_handler
is set toenable: false
, a harvester is still expected:Your harvester appears to be offline! No events for the past 600 seconds.
services="node farmer-only wallet"
, i.e. it's not running a harvester on purpose since there are no local plots.DEBUG
orINFO
, have not tested other levels.Relevant config sections:
Environment:
ghcr.io/chia-network/chia:latest
1.6.2
v0.7.5
as present insha256:4c19849657d2fc78c91257f1c18c8a18ac6a99d094e4d6c3e530300d3905a291