mozilla / identity-ops

Tools and Chef cookbooks used by Mozilla Services Operations to provision and manage Persona
Other
24 stars 12 forks source link

Update the cron job processing the metrics file to send metrics data to a kpiggybank collector #102

Open kparlante opened 11 years ago

kparlante commented 11 years ago

Moving https://github.com/mozilla/browserid/issues/3529 out of the browser id repo.

High level goal: make the data from router-metrics.json and verifier-metrics.json available to the kpi-dashboard.

Right now this data is consumed by the metrics team's pentaho servers. It gets processed on the browserid end. As part of that processing, it would be great if we could also send it to a kpiggybank collector (which is a dumb json blob store). The kpi dashboard could then pull from that collector, just as it does for the interaction_data.

So the proposed work items are:

The new kpiggybank collector needs to be accessible by the kpi-dashboard. We can schedule this before or after #92.

cc: @lloyd @shane-tomlinson @6a68 @callahad @jrgm

jaredhirsch commented 11 years ago

@kparlante Is deploying this going to be blocked by privacy review? If so, is it possible to get the ball rolling with that in parallel?

kparlante commented 11 years ago

@6a68 Yes, I'll get that ball rolling in parallel

jaredhirsch commented 11 years ago

@kparlante I'd like to make sure I understand what needs to be redacted for privacy.

The kettle scripts are unreadable XML nightmares. I'm not sure what processing they do. I can dig further, but I'd rather not try to import the inevitable Kettle-related Eclipse plugins, etc.

To kick things off, I'll paste in some logs from my local install, and you can tell me what needs to be changed. When it's time to actually implement, I'll fish around in the router and verifier source code, to be sure we identify all the fields that might be logged. (If this has already been done in some metrics bug, please let me know.)

router-metrics

Here are some log lines from my local router-metrics.json. In this case, it seems pretty clear that 'ip' is the field we'd want to remove. Is that enough? Do we need to anonymize the user-agent string, say, by truncating it?

{"level":"info","message":"{\"browser\":\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:22.0) Gecko/20100101 Firefox/22.0\",\"rp\":\"http://localhost:10002\",\"ip\":\"127.0.0.1\",\"type\":\"signin\",\"at\":\"Tue, 30 Jul 2013 16:14:53 GMT\"}","timestamp":"2013-07-30T16:14:53.834Z"}
{"level":"info","message":"{\"browser\":\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36\",\"rp\":\"http://localhost:10001\",\"ip\":\"127.0.0.1\",\"type\":\"signin\",\"at\":\"Tue, 30 Jul 2013 16:11:18 GMT\"}","timestamp":"2013-07-30T16:11:18.454Z"}
{"level":"info","message":"{\"rp\":\"undefined://\",\"ip\":\"127.0.0.1\",\"type\":\"signin\",\"at\":\"Tue, 30 Jul 2013 23:34:26 GMT\"}","timestamp":"2013-07-30T23:34:26.780Z"}

verifier-metrics

Here are some sample log lines from localhost for verifier-metrics.json. I don't see PII in here. Is there anything we'd need to modify?

{"level":"info","message":"{\"result\":\"success\",\"rp\":\"http://testdomain.com:8080\",\"type\":\"verify\",\"at\":\"Tue, 30 Jul 2013 23:38:16 GMT\"}","timestamp":"2013-07-30T23:38:16.569Z"}
{"level":"info","message":"{\"result\":\"failure\",\"reason\":\"need assertion and audience\",\"type\":\"verify\",\"at\":\"Tue, 30 Jul 2013 23:36:52 GMT\"}","timestamp":"2013-07-30T23:36:52.284Z"}
{"level":"info","message":"{\"result\":\"failure\",\"reason\":\"need assertion and audience\",\"rp\":\"http://fakesite.com:8080\",\"type\":\"verify\",\"at\":\"Tue, 30 Jul 2013 23:38:35 GMT\"}","timestamp":"2013-07-30T23:38:35.656Z"}
{"level":"info","message":"{\"result\":\"failure\",\"reason\":\"bad signature in chain\",\"rp\":\"http://fakesite.com:8080\",\"type\":\"verify\",\"at\":\"Tue, 30 Jul 2013 23:38:37 GMT\"}","timestamp":"2013-07-30T23:38:37.513Z"}
{"level":"info","message":"{\"result\":\"failure\",\"reason\":\"can't get public key for no.such.domain: no.such.domain is not a browserid primary - non-200 response code to /.well-known/browserid\",\"rp\":\"http://fakesite.com:8080\",\"type\":\"verify\",\"at\":\"Tue, 30 Jul 2013 23:38:37 GMT\"}","timestamp":"2013-07-30T23:38:37.617Z"}
{"level":"info","message":"{\"result\":\"failure\",\"reason\":\"certificate chaining is not yet allowed\",\"rp\":\"http://fakesite.com:8080\",\"type\":\"verify\",\"at\":\"Tue, 30 Jul 2013 23:38:37 GMT\"}","timestamp":"2013-07-30T23:38:37.758Z"}
{"level":"info","message":"{\"result\":\"failure\",\"reason\":\"no certificates provided\",\"rp\":\"http://fakesite.com:8080\",\"type\":\"verify\",\"at\":\"Tue, 30 Jul 2013 23:38:37 GMT\"}","timestamp":"2013-07-30T23:38:37.570Z"}
{"level":"info","message":"{\"result\":\"failure\",\"reason\":\"malformed signature\",\"rp\":\"http://fakesite.com:8080\",\"type\":\"verify\",\"at\":\"Tue, 30 Jul 2013 23:38:37 GMT\"}","timestamp":"2013-07-30T23:38:37.571Z"}
jaredhirsch commented 11 years ago

Actually, from poking at the logs onboard login.dev.anosrep.org, it looks like the IP is always 127.0.0.1, presumably because the router process lives behind the proxy.

I guess we'd just want to extract RP and timestamp, maybe user-agent, and call it a day? Do we need to worry about deduping at all? Based on the lower volume in router-metrics.json vs router.json, I'm guessing the deduping has already been done.

kparlante commented 11 years ago

@6a68

jaredhirsch commented 11 years ago

what the metrics kettle scripts currently do

based on the word doc I got via email

verifier processing:

router processing:

jaredhirsch commented 11 years ago

it occurs to me that it might be cleaner to update the existing server-side kpi code in browserid to either ship this extra info to the existing piggybank, or modify that code to allow different data to be sent to two piggybanks.

for instance, the interaction_data wsapi already fuzzes timestamps, has the coarse_user_agent_parser, and has the kpiggybank transport configured.

if you think about it, what's been suggested so far is:

  1. verifier does something
  2. verifier logs data to disk
  3. some other thing processes the data
  4. that other thing sends the data to a kpiggybank

but really, there's no reason we need to involve that log file:

  1. verifier sends data to the interaction_data wsapi directly
  2. interaction_data either sends verifier/router metrics to a second kpiggybank, or just sends it to the existing endpoint

I'm still reading the interaction_data wsapi code, so maybe there are limitations I'm not aware of?

jaredhirsch commented 11 years ago

Hmm. Reading the code makes me wonder: do we really need the *-metrics.json files at all? Are they used for anything other than feeding data to the metrics team? if we don't need those files, we can delete lib/metrics.js from the browserid repo. @shane-tomlinson do you have context here?

jaredhirsch commented 11 years ago

I think I know why the IP is always 127.0.0.1: we should be using the x-forwarded-for header, but aren't

jaredhirsch commented 11 years ago

@kparlante is it necessary to use a separate kpiggybank instance for this data? if we were to just add some additional logging to the existing installation, would that work? it seems most expedient to me

jrgm commented 11 years ago

A couple of considerations:

  1. I think keeping things decoupled has value. i.e., don't couple the primary thing (verify) with something secondary.
  2. Some of the log lines that are in metrics are the best information we have in debugging a problem. (That information shouldn't be in the metrics log, but they are - there's an issue somewhere to improve that, but I can't find it at the moment).
kparlante commented 11 years ago

@6a68 Ok, after thinking about this a bit more...

jaredhirsch commented 11 years ago

@kparlante You raise some good points! TL;DR - There's tension between what we could do, how much time I've got, and how soon we want RP data. I'm aiming low, but trying to go fast.

The Right Way to handle all of this, IMO, is to keep every individual node stateless (since nodes can vanish without warning in AWS), forward every log message to a central collector (maybe logstash+rabbitmq/S3), and filter messages to appropriate transports (whether flat files, the metrics team endpoint, our third-party CEF security service, relational DBs, or RRDBs like statsd). Doing it The Right Way would remove the weird scattering of log libraries that we currently have, which would be great. It would take a lot of time to do properly, and would require a lot of ops work, too, not so great if we want to see these RP stats in Q3.

I've been trying to add as little code as possible, in the interest of actually shipping a fix right away, and deferring the larger problem. This isn't the best way, but it's the quickest way I can think of. I'm really happy to discuss alternatives, though.

To address your questions directly:

jaredhirsch commented 11 years ago

@jrgm you got a gravatar?! Awesome!

re: 1, I'm not quite sure what you mean, but I don't intend to block verifier on kpi, more of a fire-and-forget configuration. re: 2, We're not removing the existing metrics files, just sending additional data to kpiggybank. The logging situation isn't awesome, I totally agree.

Does this make sense?

kparlante commented 11 years ago

@6a68 Makes total sense, no objections to your logic. I like your longer term Right Way vision. Good points about AWS filesystem -- very interesting!

jaredhirsch commented 11 years ago

@kparlante Cool, trying to be verbose in this ticket for the sake of future readers (like me after I lose context :-P). Please don't hesitate to challenge what I'm doing and/or offer alternatives, it makes for much better net results.

Related question: the privacy review doc claims that timestamps must be in unix format, although the example is actually a JS timestamp, and it looks like we're actually logging JS timestamps (millisecond, not second, resolution). For instance, on localhost:10001, if I open, then close, the dialog, I see this in stdout:

browserid (75972): debug: Simulate write to KPI Backend DB - [
<snipped>
browserid (75972):     "timestamp": 1376523000000,
<other stuff...>

It seems like the timestamp should be 1376523000 (three fewer zeros). Have there been discussions around the timestamp format? Is this actually converted inside kpiggybank, maybe?

jaredhirsch commented 11 years ago

I've pushed some additional rough cuts to 6a68/issue_3528_metrics_to_piggybank. I'll push on this a bit more tomorrow. @shane-tomlinson it'd be great if you could take a look :-)

jaredhirsch commented 11 years ago

@kparlante would it hurt anything if we didn't aggregate the verifier/router metrics before sending to the piggybank?

kparlante commented 11 years ago

@6a68 no timestamp conversion happens in kpiggybank; the timestamp field in the blobs as they land in the dashboard's db are in milliseconds, not seconds. For expediency's sake, go with millisecond resolution.

Not aggregating shouldn't be a problem from an implementation perspective. Privacy/extra traffic are the only concerns I could think of, and doesn't seem like either should be an actual problem.

jaredhirsch commented 11 years ago

@kparlante fantastic, in that case the PR against my fork is pretty much ready to go. I'll get some eyes on it Monday.

jaredhirsch commented 11 years ago

Opened a pull request against the upstream browserid repo. Do we need to hold back until the privacy review is completed?

jaredhirsch commented 11 years ago

Answering my own question from last week, the privacy bug (https://bugzilla.mozilla.org/show_bug.cgi?id=903581) has not moved. I bumped it to try to get some attention.

kparlante commented 11 years ago

New and improved privacy megabug: https://bugzilla.mozilla.org/show_bug.cgi?id=909980

kparlante commented 11 years ago

@shane-tomlinson, @6a68 : We shouldn't hold back on this code waiting for privacy review -- we're on the calendar. We can add a flag to turn it on/off (as we did with KPIs) to be safe on the timing. We have an ok from the security folks (given that we're using the same infrastructure as KPIs), pending any other questions from us.