ooni / backend

Everything related to OONI backend infrastructure: ooni/api, ooni/pipeline, ooni/sysadmin, collector, bouncers and test-helpers
BSD 3-Clause "New" or "Revised" License
51 stars 29 forks source link

Monitoring epic, v3 #243

Open darkk opened 6 years ago

darkk commented 6 years ago
darkk commented 5 years ago

BTW, it's relatively easy to monitor tor hidden service within current setup. The plan is:

darkk commented 5 years ago

Actually, I'm very unsure if all of the above makes sense right now. I've observed that capacity and rare-errors alerts at #ooni-bots are often neither acted upon not silenced, so they turn the channel into stream of noise. I assume that adding more of that would be counterproductive with current alerting setup.

So, given recent incidents, I'm limiting it to monitoring of public *.onion endpoints of OONI at this moment as it's both a non-trivial and needed check.

hellais commented 5 years ago

I agree that it's more important to conclude the other pipeline related tasks and the current state of monitoring doesn't warrant for more noise.

Speaking of which @darkk is it possible to reduce the level of noisiness of the onion service alerts? I have seen them get pretty chatty recently.

darkk commented 5 years ago

Well, they were added recently and were consistently failing :-) Let's try to bump timeouts: 36f0d85aeaa5b45b5d3db146be21839fb2ba5edd, but maybe that's some issue with publication of onion service descriptors or something else that may be worth investigation.

hellais commented 5 years ago

I see this has been moved into the "Ready for Review" column. I guess there is still some stuff that we may want to do related to this and it's probably wise to keep the issue open.

Perhaps the best course of action is to unassign it from you and remove it from the Leonid plan board?

darkk commented 5 years ago

Yes, that was exactly the point.

Implementation of the items mentioned in the ticket may be valuable in the future. The current value of those items is unclear (I'm unsure who will react to those alerts and what the reaction should be) and that's why they were postponed at the first place. I think, the important thing to do before going further with moar-alerts is to build a better process around alerts handling.

So, as I've said, I've limited my activity to monitoring *.onion endpoints as 1) it is not so trivial to set-up, 2) it is important as those public endpoints going down lead to data loss according to https://github.com/ooni/sysadmin/pull/277#issuecomment-469057510

hellais commented 4 years ago

cc @FedericoCeratto