mozilla / ssh_scan_api

An API for ssh_scan (https://github.com/mozilla/ssh_scan) and the backend API service for the Mozilla SSH Observatory (https://observatory.mozilla.org/)
31 stars 12 forks source link

Add Service Monitoring #103

Open claudijd opened 6 years ago

claudijd commented 6 years ago

Usually, April is the first person to hear about Mozilla SSH Observatory issues because she's working Observatory stuff a lot more than I. However, these issues generally boil down to one of two areas, which I should just add monitoring to let me know, so I'm the first person to know.

1.) Alert me when the site is not responding (this is usually nginx restarting and failing or a failed lets encrypt renew) 2.) Alert me when the queues are non-zero and not changing (this is usually an indication that something is broken or site abuse)

claudijd commented 6 years ago

Requested via MOC in bug https://bugzilla.mozilla.org/show_bug.cgi?id=1390296

floatingatoll commented 6 years ago

You may find more value in “max queue age” than raw count, since max age should be fixed at some small value thanks to autoscale.

On Mon, Aug 14, 2017 at 13:54 Jonathan Claudius notifications@github.com wrote:

Requested via MOC in bug https://bugzilla.mozilla.org/show_bug.cgi?id=1390296

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mozilla/ssh_scan_api/issues/103#issuecomment-322306705, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFqDGwX20tQy9_IOWmcIV2WZNn62d4Fks5sYLPngaJpZM4O2a5g .

claudijd commented 6 years ago

@floatingatoll good point, I'll need to add a reporting attribute to the stats to ensure this is visible. I like it a lot because it doesn't require a monitoring endpoint to maintain state between checks. It would just say if "max queue age" gets past X then alert.

claudijd commented 6 years ago

QUEUED_MAX_AGE attribute has been deployed to production and can be seen here...

https://sshscan.rubidus.com/api/v1/stats

Acceptable tolerances requested of MOC are between 0-30 seconds. Anything outside that is either an infrastructure issue or an abuse scenario, which fundamentally affects a user experience.