Open darkk opened 6 years ago
BTW, it's relatively easy to monitor tor hidden service within current setup. The plan is:
Actually, I'm very unsure if all of the above makes sense right now. I've observed that capacity and rare-errors alerts at #ooni-bots
are often neither acted upon not silenced, so they turn the channel into stream of noise. I assume that adding more of that would be counterproductive with current alerting setup.
So, given recent incidents, I'm limiting it to monitoring of public *.onion
endpoints of OONI at this moment as it's both a non-trivial and needed check.
I agree that it's more important to conclude the other pipeline related tasks and the current state of monitoring doesn't warrant for more noise.
Speaking of which @darkk is it possible to reduce the level of noisiness of the onion service alerts? I have seen them get pretty chatty recently.
Well, they were added recently and were consistently failing :-) Let's try to bump timeouts: 36f0d85aeaa5b45b5d3db146be21839fb2ba5edd, but maybe that's some issue with publication of onion service descriptors or something else that may be worth investigation.
I see this has been moved into the "Ready for Review" column. I guess there is still some stuff that we may want to do related to this and it's probably wise to keep the issue open.
Perhaps the best course of action is to unassign it from you and remove it from the Leonid plan board?
Yes, that was exactly the point.
Implementation of the items mentioned in the ticket may be valuable in the future. The current value of those items is unclear (I'm unsure who will react to those alerts and what the reaction should be) and that's why they were postponed at the first place. I think, the important thing to do before going further with moar-alerts is to build a better process around alerts handling.
So, as I've said, I've limited my activity to monitoring *.onion
endpoints as 1) it is not so trivial to set-up, 2) it is important as those public endpoints going down lead to data loss according to https://github.com/ooni/sysadmin/pull/277#issuecomment-469057510
cc @FedericoCeratto
Oops
andBUG:
in kernel log ooni/sysadmin#210engine_daemon_container_actions_seconds_count{action="start"}
ooni/sysadmin#220