audit mozlog usage - Githubissues

jaredhirsch commented 9 years ago

Some mozlog nits:

The first argument should be a String with no spaces. It will become part of the op that Heka filters by. You probably won't be logging the verbose lines, but this warn (and the others) will trip things up. You can set debug to true in mozlog config to add an assertion you're playing by the rules.

If you pass an Error as an argument, instead of concatenating, you'll get a stack trace. shrug

Let's use the Heka-style reporter and make sure the output works properly across the whole server.

jaredhirsch commented 9 years ago

more from that same PR: "the 2 strings will be concatenated together. So log.info('checkAuth') would create the op of chronicle.server.bell.profile.checkAuth."

jaredhirsch commented 9 years ago

need to double-check that mozlog is statsd-compatible
also need to make sure we never log PII at any level above verbose. This will make production debugging tricky, might as well learn the ropes now.

jaredhirsch commented 9 years ago

ehh, bumping to sprint 5

jaredhirsch commented 9 years ago

Here's a sketch of a more coherent approach to logging. I think these notes are detailed enough that it's pretty clear which level to use when developing, and by using all the available levels, we can record that errors have occurred, but avoid paging ops for non-critical errors. We can just wire up the critical errors to the pagerduty queue and ship the rest of the errors to graphite for offline analysis ("oh look, that's a lot of weird errors from that part of the app since our last release...")

@mostlygeek @nchapman @pdehaan input welcome

log level conventions

Our log library offers 7 levels of log verbosity. Here's how we use them:

log.critical: used to mark unrecoverable errors that prevent the server from starting or running
- example: database is totally gone, work queue is totally gone
- best to just call process.exit and shut down. (TODO make our daemonizing tools aware, rather than naively restarting on massive failure)
log.error: used to mark errors that cause the current user request to fail; user will receive a 4xx or 5xx error, but can try again
- example: a database query needed for the current user request times out
log.warn: used to mark non-critical errors or warnings in background tasks (that is, tasks where there's no user waiting for the response)
- we'll sometimes reuse methods for foreground and background tasks; use the more severe log.error level in those cases
- warn should be reserved for situations where the retry logic is wired into the app, so that recovery is automated; if retry isn't handled gracefully, go with log.error instead
- example: a worker fails because a remote API timed out, but the work queue will retry automatically
log.info: used to mark significant events in the ordinary life of the application; the default production log level
- example: log a request to a given endpoint from a given user agent (although ELB gives this to us in production)
log.debug: used to log extra information useful in debugging, like marking entrance and exit from a function, but which must not include user data
- this is the most detailed log level that we would ever want to use in a production environment
- examples:
- ordinary (non-error) events in the lifecycle of a job, worker, connection pool, process, or http request
- log the time needed to complete a database query (just subtract two timestamps)
- mark entrance and exit from a function
log.verbose: same as log.debug, except we do include user data in the log lines
log.trace: useful for debugging extraordinarily difficult problem spots: the logger will include a traceback along with the log line

nchapman commented 9 years ago

Related: #275

jaredhirsch commented 9 years ago

if we opt for structured logs with object hashes, we can ensure that production levels wipe out password and userId and other PII to prevent user data being displayed . or we can stick with flat log messages and use log levels to protect user data in prod.

mostlygeek commented 9 years ago

Did you just copy/pasta the syslog levels? :)

So we have this: https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Logging+Standard which is our currently sort of agreed upon logging standard. If you log output in this format we already have the heka transports available w/out any custom code.

Additionally, I like the idea of monitoring based on the number of messages generated at a specific log level! If you could stick these into statsd output we can set thresholds. It'd be a new experiment for us and it'll decouple metrics/logging from monitoring/alerting.

jaredhirsch commented 9 years ago

@mostlygeek Those are the log levels exposed by mozlog, which has heka integration built-in (that was my reason for choosing it). In these docs, I'm just trying to take advantage of the pre-defined granularity to partially solve our story around logging and partially come up with a story around error management (eg, bend vs break should imply there's a nearby log.critical vs log.warn).

I don't know if mozlog supports statsd format, I'll look into the simplest (1 hour or less) workaround.

jaredhirsch commented 9 years ago

One important consequence of not ever leaking user data is that we can't log the errors themselves at any level higher than verbose, since the error object might contain user data in it.

This means those log lines need to be quite liberally sprinkled through the code, since we will rarely see a stack trace in production.

jaredhirsch commented 9 years ago

whew, I really don't like the code I wound up with here. Too complicated for our current needs. The idea of never logging user info is cool in theory, but it's way better to anonymize users instead (as part of the logging / metrics pipeline), so that we can draw useful inferences. Example: how many users visit the site daily vs weekly.

I've officially fallen down the rabbit hole on this task, so, going to pause for now, meet with the datapipeline team, rethink what ops / metrics needs vs what debugging needs, and get back to this later in the sprint, with a tight timebox.

mostlygeek commented 9 years ago

If you want to uniquely identify people but also keep it anonymous, you could sha1/md5 whatever key you're using. I would suggest something like email address instead of a monotonically increasing UID. If the UID is a uuid, then sha1(uuid) may be good enough.

mozilla / chronicle

audit mozlog usage #87

log level conventions