Improve logging - Githubissues

tsibley commented 2 years ago

Logging in nextstrain.org is currently ad-hoc and sometimes useful, sometimes not enough. The goal is to make what's logged more systematic and comprehensively useful while at the same time improving our ability to inspect and act on the logs.

Below are some proposed changes to make, spanning both the codebase itself as well as our deployment infra. I expect to break each of these items out into separate issues in due course.

[ ] Standardize on debug or bunyan for emitting logs.

debug is already used by Express and several related deps (set DEBUG='*' to see these logs), which is a point in its favor, but it's relatively limited (e.g. no log levels, no structured logs, no child loggers). bunyan provides a more traditional and modern set of logging features, including a focus on structured logs which greatly helps with analysis.

Retire our existing logging functions (e.g. utils.verbose()) in favor of the above.
[ ] Switch from ad-hoc messages that embed limited context to structured logs with increased context across the board (e.g. all logs include req context, incl. critical request ids).

The ad-hoc messages don't all have to go away—they're still useful for skimming by eye, particularly if well-written—but they shouldn't be the only thing produced by a log line.
[ ] Archive raw logs to S3 for retrospective debugging/analysis.

Archives should be raw, straight from Heroku's Logplex service.

Two quick examples of things archives are useful for:
- Investigating reports like "this weird thing happened last week"
- Analyzing the impact of a bug over time, e.g. "how many requests were impacted by this over the last month?"
Keep the retention period fairly short so we're not storing (potentially sensitive) data we don't need. Maybe retain for 30 days? (which is also the minimum duration you pay for with S3)

Papertrail supports S3 archiving, but we don't currently ingest raw logs into Papertrail due to account limitations (see our incoming log filters).
[ ] Ingest logs into analysis system for real-time inspection.

Currently this is Papertrail, a Heroku add-on. It's got a nice straightforward interface. However, we're using a pretty stringent free tier (10MB/day, 2d retention for search) so have to omit lots of useful logs (e.g. all GET requests!). Retention period is also pretty short. Paid tiers that would accommodate our usage largely unfiltered get pricey.

Logtail is another (glossier) app that also has a Heroku add-on, with a free tier that's a bit more generous than Papertrail (5GB/month, 8d retention for search). Even it, however, would require a bit of filtering on ingest and prices increase quickly from free. Interestingly, the Heroku add-on's free tier is more generous than the non-Heroku free tier of the same product.

Grafana Logs is part of Grafana's hosted offering, Grafana Cloud, with a generous free tier of 50GB logs and 14d retention. No Heroku add-on, so we'd have to manually configure the Heroku log drain → Grafana connection. This seems straightforward to do with Vector (a log-processing multi-tool), though means we'll have to run Vector somewhere persistent. The smallest EC2 instance or Heroku dyno would almost certainly be fine, both at around $7/month. Fly.io uses Vector exactly like this for their fly-log-shipper sidecar service.

A big benefit of using Grafana is that we'd also get an industry-leading suite of metrics, monitoring, dashboarding, and alerting tools. For the Seattle Flu infra, Grafana was a very nice place to ship our metrics and logs to. We don't have non-Heroku-generated metrics for nextstrain.org right now, but there are several I think would be good to start tracking. Downside is it does include an additional moving part we'd have to manage (the connection between Heroku and it). For this reason primarily, it could make sense to stick with Papertrail (or switch to Logtail) and pay a bit more (or ask for a gratis bump up as a non-profit).

jameshadfield commented 6 months ago

Standardize on debug or bunyan for emitting logs.

We only have 40-50 usages of our utils.{verbose,log,warn,error} functions in the entire server and I'd imagine switching them all to {debug,console.*} calls would be well worth it in the short term. This doesn't rule out a move to some other logging tool in the future, but debug calls are already widely used throughout the server (mostly implicitly).

Is there an agreed standard for general want-to-always-see-this logging when using the 'debug' module? Its docs seem to suggest we'd do something like debug('nextstrain') and then habitually run with at least DEBUG=nextstrain.

tsibley commented 6 months ago

Is there an agreed standard for general want-to-always-see-this logging when using the 'debug' module? Its docs seem to suggest we'd do something like debug('nextstrain') and then habitually run with at least DEBUG=nextstrain.

Not really? I mean, you always have to specify DEBUG at invocation, or modify it programmatically very early at process start (but there be pitfalls). It really is meant for debug logging which defaults to off.

This is why I do think we want a proper logging library beyond debug.

But agreed that ditching utils.{verbose,log,warn,error} for {debug,console.*} in the short-term would probably be worthwhile.

nextstrain / nextstrain.org

Improve logging #511