Update the cron job processing the metrics file to send metrics data to a kpiggybank collector

kparlante commented 11 years ago

Moving https://github.com/mozilla/browserid/issues/3529 out of the browser id repo.

High level goal: make the data from router-metrics.json and verifier-metrics.json available to the kpi-dashboard.

Right now this data is consumed by the metrics team's pentaho servers. It gets processed on the browserid end. As part of that processing, it would be great if we could also send it to a kpiggybank collector (which is a dumb json blob store). The kpi dashboard could then pull from that collector, just as it does for the interaction_data.

So the proposed work items are:

(edit: moved this task into #117) setup a separate kpiggybank collector for these metrics
modify the chron job to send the json data from router-metrics.json and verifier-metrics.json to the new collector (having stripped out user's ip)

The new kpiggybank collector needs to be accessible by the kpi-dashboard. We can schedule this before or after #92.

cc: @lloyd @shane-tomlinson @6a68 @callahad @jrgm

jaredhirsch commented 11 years ago

@kparlante Is deploying this going to be blocked by privacy review? If so, is it possible to get the ball rolling with that in parallel?

kparlante commented 11 years ago

@6a68 Yes, I'll get that ball rolling in parallel

jaredhirsch commented 11 years ago

@kparlante I'd like to make sure I understand what needs to be redacted for privacy.

The kettle scripts are unreadable XML nightmares. I'm not sure what processing they do. I can dig further, but I'd rather not try to import the inevitable Kettle-related Eclipse plugins, etc.

To kick things off, I'll paste in some logs from my local install, and you can tell me what needs to be changed. When it's time to actually implement, I'll fish around in the router and verifier source code, to be sure we identify all the fields that might be logged. (If this has already been done in some metrics bug, please let me know.)

router-metrics

Here are some log lines from my local router-metrics.json. In this case, it seems pretty clear that 'ip' is the field we'd want to remove. Is that enough? Do we need to anonymize the user-agent string, say, by truncating it?

{"level":"info","message":"{\"browser\":\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:22.0) Gecko/20100101 Firefox/22.0\",\"rp\":\"http://localhost:10002\",\"ip\":\"127.0.0.1\",\"type\":\"signin\",\"at\":\"Tue, 30 Jul 2013 16:14:53 GMT\"}","timestamp":"2013-07-30T16:14:53.834Z"}
{"level":"info","message":"{\"browser\":\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36\",\"rp\":\"http://localhost:10001\",\"ip\":\"127.0.0.1\",\"type\":\"signin\",\"at\":\"Tue, 30 Jul 2013 16:11:18 GMT\"}","timestamp":"2013-07-30T16:11:18.454Z"}
{"level":"info","message":"{\"rp\":\"undefined://\",\"ip\":\"127.0.0.1\",\"type\":\"signin\",\"at\":\"Tue, 30 Jul 2013 23:34:26 GMT\"}","timestamp":"2013-07-30T23:34:26.780Z"}

verifier-metrics

Here are some sample log lines from localhost for verifier-metrics.json. I don't see PII in here. Is there anything we'd need to modify?

{"level":"info","message":"{\"result\":\"success\",\"rp\":\"http://testdomain.com:8080\",\"type\":\"verify\",\"at\":\"Tue, 30 Jul 2013 23:38:16 GMT\"}","timestamp":"2013-07-30T23:38:16.569Z"}
{"level":"info","message":"{\"result\":\"failure\",\"reason\":\"need assertion and audience\",\"type\":\"verify\",\"at\":\"Tue, 30 Jul 2013 23:36:52 GMT\"}","timestamp":"2013-07-30T23:36:52.284Z"}
{"level":"info","message":"{\"result\":\"failure\",\"reason\":\"need assertion and audience\",\"rp\":\"http://fakesite.com:8080\",\"type\":\"verify\",\"at\":\"Tue, 30 Jul 2013 23:38:35 GMT\"}","timestamp":"2013-07-30T23:38:35.656Z"}
{"level":"info","message":"{\"result\":\"failure\",\"reason\":\"bad signature in chain\",\"rp\":\"http://fakesite.com:8080\",\"type\":\"verify\",\"at\":\"Tue, 30 Jul 2013 23:38:37 GMT\"}","timestamp":"2013-07-30T23:38:37.513Z"}
{"level":"info","message":"{\"result\":\"failure\",\"reason\":\"can't get public key for no.such.domain: no.such.domain is not a browserid primary - non-200 response code to /.well-known/browserid\",\"rp\":\"http://fakesite.com:8080\",\"type\":\"verify\",\"at\":\"Tue, 30 Jul 2013 23:38:37 GMT\"}","timestamp":"2013-07-30T23:38:37.617Z"}
{"level":"info","message":"{\"result\":\"failure\",\"reason\":\"certificate chaining is not yet allowed\",\"rp\":\"http://fakesite.com:8080\",\"type\":\"verify\",\"at\":\"Tue, 30 Jul 2013 23:38:37 GMT\"}","timestamp":"2013-07-30T23:38:37.758Z"}
{"level":"info","message":"{\"result\":\"failure\",\"reason\":\"no certificates provided\",\"rp\":\"http://fakesite.com:8080\",\"type\":\"verify\",\"at\":\"Tue, 30 Jul 2013 23:38:37 GMT\"}","timestamp":"2013-07-30T23:38:37.570Z"}
{"level":"info","message":"{\"result\":\"failure\",\"reason\":\"malformed signature\",\"rp\":\"http://fakesite.com:8080\",\"type\":\"verify\",\"at\":\"Tue, 30 Jul 2013 23:38:37 GMT\"}","timestamp":"2013-07-30T23:38:37.571Z"}

jaredhirsch commented 11 years ago

Actually, from poking at the logs onboard login.dev.anosrep.org, it looks like the IP is always 127.0.0.1, presumably because the router process lives behind the proxy.

I guess we'd just want to extract RP and timestamp, maybe user-agent, and call it a day? Do we need to worry about deduping at all? Based on the lower volume in router-metrics.json vs router.json, I'm guessing the deduping has already been done.

kparlante commented 11 years ago

@6a68

We should probably capture success/failure from the verifier-metrics
We should probably fuzz the timestamp to the nearest 10 minute interval
I do think the kettle scripts match user-agent string to a browser, there may be a privacy reason for doing so. For KPI data, this is handled by this code: https://github.com/mozilla/browserid/blob/dev/lib/coarse_user_agent_parser.js (As a first pass, the user-agent string is not critical btw)
Not sure about deduping
It surprises me that the IP is always 127.0.0.1, but I suppose that is neither here nor there as we don't need it. I think the Kettle scripts count unique IPs and even do some work to identify region, but that data is never used. (I will send you the document I have about the ETL scripts).

jaredhirsch commented 11 years ago

what the metrics kettle scripts currently do

based on the word doc I got via email

verifier processing:

ignore lines where result != success or type != verify
aggregate information by date and RP
report the counts

router processing:

aggregate information on date, rp, and browser name, browser version, and browser category (??) by parsing the user-agent string

jaredhirsch commented 11 years ago

it occurs to me that it might be cleaner to update the existing server-side kpi code in browserid to either ship this extra info to the existing piggybank, or modify that code to allow different data to be sent to two piggybanks.

for instance, the interaction_data wsapi already fuzzes timestamps, has the coarse_user_agent_parser, and has the kpiggybank transport configured.

if you think about it, what's been suggested so far is:

verifier does something
verifier logs data to disk
some other thing processes the data
that other thing sends the data to a kpiggybank

but really, there's no reason we need to involve that log file:

verifier sends data to the interaction_data wsapi directly
interaction_data either sends verifier/router metrics to a second kpiggybank, or just sends it to the existing endpoint

I'm still reading the interaction_data wsapi code, so maybe there are limitations I'm not aware of?

jaredhirsch commented 11 years ago

Hmm. Reading the code makes me wonder: do we really need the *-metrics.json files at all? Are they used for anything other than feeding data to the metrics team? if we don't need those files, we can delete lib/metrics.js from the browserid repo. @shane-tomlinson do you have context here?

jaredhirsch commented 11 years ago

I think I know why the IP is always 127.0.0.1: we should be using the x-forwarded-for header, but aren't

jaredhirsch commented 11 years ago

@kparlante is it necessary to use a separate kpiggybank instance for this data? if we were to just add some additional logging to the existing installation, would that work? it seems most expedient to me

jrgm commented 11 years ago

A couple of considerations:

I think keeping things decoupled has value. i.e., don't couple the primary thing (verify) with something secondary.
Some of the log lines that are in metrics are the best information we have in debugging a problem. (That information shouldn't be in the metrics log, but they are - there's an issue somewhere to improve that, but I can't find it at the moment).

kparlante commented 11 years ago

@6a68 Ok, after thinking about this a bit more...

Do we want to make an https connection for every metrics write? Or write to a file and then make a connection once a day? I assume eventually the kpiggybank will reside in the same vpc as the browserid servers, and we'll have some scaling strategy, so perhaps in the long run a connection for each write isn't a problem.
The advantage of the filesystem is that writes are quick and reliable. Arguably we should have had a similar setup for kpi data, instead of connecting to the kpiggybank server from the same servers running the main services. FWIW, kpiggybank does go down! We have lost data. That said, we're not actually proposing stopping the existing metrics infrastructure just yet, as the existing metrics dashboards will continue to rely on those files. So that data will be there for jrgm's debugging for now.
It does make sense to reuse coarse_user_agent_parser and the timestamp fuzzing code if possible, but repeating this logic and the server connection code in some script isn't the end of the world?
I see jrgm's point about not making verify have any knowledge of/dependencies on kpiggybank. Isn't there a plan to move the verifier out on its own?
Does it make sense to encapsulate the kpiggybank post within the metrics module instead of outside of it?

jaredhirsch commented 11 years ago

@kparlante You raise some good points! TL;DR - There's tension between what we could do, how much time I've got, and how soon we want RP data. I'm aiming low, but trying to go fast.

The Right Way to handle all of this, IMO, is to keep every individual node stateless (since nodes can vanish without warning in AWS), forward every log message to a central collector (maybe logstash+rabbitmq/S3), and filter messages to appropriate transports (whether flat files, the metrics team endpoint, our third-party CEF security service, relational DBs, or RRDBs like statsd). Doing it The Right Way would remove the weird scattering of log libraries that we currently have, which would be great. It would take a lot of time to do properly, and would require a lot of ops work, too, not so great if we want to see these RP stats in Q3.

I've been trying to add as little code as possible, in the interest of actually shipping a fix right away, and deferring the larger problem. This isn't the best way, but it's the quickest way I can think of. I'm really happy to discuss alternatives, though.

To address your questions directly:

An https connection per metrics write? I'm not sure how costly this will be in practice, it's a great question.
Filesystem writes aren't terribly quick on AWS, and there is a hidden assumption that nodes/filesystems won't just disappear. If we wait till once-a-day rolls around to scp over some data, then we should expect to lose significant chunks of data due to nodes dying before they can send their data over. Planning for VM death is the fun part of moving to AWS :-)
If we copy code instead of reusing it, we're inevitably going to wind up with the copied code not catching bug fixes in the main repo. meh.
I don't think the verifier will be extracted near-term; there are a lot of consequences for how we build/deploy. We can probably ignore that for now.
I'd rather not encapsulate kpiggybank calls inside lib/metrics, or vice versa, because they're roughly equivalent: they are transports that put data in different places. It's nice to see, for instance, the cef_logger calls next to the logger calls throughout the wsapis (example). I think I'd like to stick with that approach, rather than have to read inside lib/metrics to realize that that's where we filter a subset of messages off to the kpi endpoint.

jaredhirsch commented 11 years ago

@jrgm you got a gravatar?! Awesome!

re: 1, I'm not quite sure what you mean, but I don't intend to block verifier on kpi, more of a fire-and-forget configuration. re: 2, We're not removing the existing metrics files, just sending additional data to kpiggybank. The logging situation isn't awesome, I totally agree.

Does this make sense?

kparlante commented 11 years ago

@6a68 Makes total sense, no objections to your logic. I like your longer term Right Way vision. Good points about AWS filesystem -- very interesting!

jaredhirsch commented 11 years ago

@kparlante Cool, trying to be verbose in this ticket for the sake of future readers (like me after I lose context :-P). Please don't hesitate to challenge what I'm doing and/or offer alternatives, it makes for much better net results.

Related question: the privacy review doc claims that timestamps must be in unix format, although the example is actually a JS timestamp, and it looks like we're actually logging JS timestamps (millisecond, not second, resolution). For instance, on localhost:10001, if I open, then close, the dialog, I see this in stdout:

browserid (75972): debug: Simulate write to KPI Backend DB - [
<snipped>
browserid (75972):     "timestamp": 1376523000000,
<other stuff...>

It seems like the timestamp should be 1376523000 (three fewer zeros). Have there been discussions around the timestamp format? Is this actually converted inside kpiggybank, maybe?

jaredhirsch commented 11 years ago

I've pushed some additional rough cuts to 6a68/issue_3528_metrics_to_piggybank. I'll push on this a bit more tomorrow. @shane-tomlinson it'd be great if you could take a look :-)

jaredhirsch commented 11 years ago

@kparlante would it hurt anything if we didn't aggregate the verifier/router metrics before sending to the piggybank?

kparlante commented 11 years ago

@6a68 no timestamp conversion happens in kpiggybank; the timestamp field in the blobs as they land in the dashboard's db are in milliseconds, not seconds. For expediency's sake, go with millisecond resolution.

Not aggregating shouldn't be a problem from an implementation perspective. Privacy/extra traffic are the only concerns I could think of, and doesn't seem like either should be an actual problem.

jaredhirsch commented 11 years ago

@kparlante fantastic, in that case the PR against my fork is pretty much ready to go. I'll get some eyes on it Monday.

jaredhirsch commented 11 years ago

Opened a pull request against the upstream browserid repo. Do we need to hold back until the privacy review is completed?

jaredhirsch commented 11 years ago

Answering my own question from last week, the privacy bug (https://bugzilla.mozilla.org/show_bug.cgi?id=903581) has not moved. I bumped it to try to get some attention.

kparlante commented 11 years ago

New and improved privacy megabug: https://bugzilla.mozilla.org/show_bug.cgi?id=909980

kparlante commented 11 years ago

@shane-tomlinson, @6a68 : We shouldn't hold back on this code waiting for privacy review -- we're on the calendar. We can add a flag to turn it on/off (as we did with KPIs) to be safe on the timing. We have an ok from the security folks (given that we're using the same infrastructure as KPIs), pending any other questions from us.

mozilla / identity-ops