Open jaredhirsch opened 9 years ago
more from that same PR: "the 2 strings will be concatenated together. So log.info('checkAuth')
would create the op of chronicle.server.bell.profile.checkAuth
."
verbose
. This will make production debugging tricky, might as well learn the ropes now.ehh, bumping to sprint 5
Here's a sketch of a more coherent approach to logging. I think these notes are detailed enough that it's pretty clear which level to use when developing, and by using all the available levels, we can record that errors have occurred, but avoid paging ops for non-critical errors. We can just wire up the critical errors to the pagerduty queue and ship the rest of the errors to graphite for offline analysis ("oh look, that's a lot of weird errors from that part of the app since our last release...")
@mostlygeek @nchapman @pdehaan input welcome
Our log library offers 7 levels of log verbosity. Here's how we use them:
process.exit
and shut down. (TODO make our daemonizing tools aware, rather than naively restarting on massive failure)Related: #275
if we opt for structured logs with object hashes, we can ensure that production levels wipe out password and userId and other PII to prevent user data being displayed . or we can stick with flat log messages and use log levels to protect user data in prod.
Did you just copy/pasta the syslog levels? :)
So we have this: https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Logging+Standard which is our currently sort of agreed upon logging standard. If you log output in this format we already have the heka transports available w/out any custom code.
Additionally, I like the idea of monitoring based on the number of messages generated at a specific log level! If you could stick these into statsd output we can set thresholds. It'd be a new experiment for us and it'll decouple metrics/logging from monitoring/alerting.
@mostlygeek Those are the log levels exposed by mozlog, which has heka integration built-in (that was my reason for choosing it). In these docs, I'm just trying to take advantage of the pre-defined granularity to partially solve our story around logging and partially come up with a story around error management (eg, bend vs break should imply there's a nearby log.critical vs log.warn).
I don't know if mozlog supports statsd format, I'll look into the simplest (1 hour or less) workaround.
One important consequence of not ever leaking user data is that we can't log the errors themselves at any level higher than verbose, since the error object might contain user data in it.
This means those log lines need to be quite liberally sprinkled through the code, since we will rarely see a stack trace in production.
whew, I really don't like the code I wound up with here. Too complicated for our current needs. The idea of never logging user info is cool in theory, but it's way better to anonymize users instead (as part of the logging / metrics pipeline), so that we can draw useful inferences. Example: how many users visit the site daily vs weekly.
I've officially fallen down the rabbit hole on this task, so, going to pause for now, meet with the datapipeline team, rethink what ops / metrics needs vs what debugging needs, and get back to this later in the sprint, with a tight timebox.
If you want to uniquely identify people but also keep it anonymous, you could sha1/md5 whatever key you're using. I would suggest something like email address instead of a monotonically increasing UID. If the UID is a uuid, then sha1(uuid) may be good enough.
from PR 86 review:
Let's use the Heka-style reporter and make sure the output works properly across the whole server.