mozilla / fxa-amplitude-send

Data pipeline scripts for importing Firefox Accounts event data to Amplitude.
Mozilla Public License 2.0
7 stars 9 forks source link

feat(scripts): implement a pubsub import script #90

Closed philbooth closed 5 years ago

philbooth commented 5 years ago

Fixes #89. Fixes mozilla/fxa#866.

Ports the existing main import script to node and slaps a pubsub-friendly interface on it. Still in progress/untested. Doesn't yet include @shane-tomlinson's logic for ignoring configured client ids.

philbooth commented 5 years ago

So I've deployed the current version to GCP, but based on some console.logs I added it looks like the function gets invoked with individual events, which I wasn't expecting. I was hoping they'd come in batches which we could pass on to Amplitude in batches too.

I've deleted the cloud function for now so it's not spinning overnight, and will come back to this again in the morning.

jbuck commented 5 years ago

Oh man, that's annoying. I wonder why it doesn't support any batching...

philbooth commented 5 years ago

it looks like the function gets invoked with individual events

I wonder why it doesn't support any batching...

So, looking at the docs I think maybe it does, but we have to configure that explicitly ourselves.

It's not completely clear, but see for example:

philbooth commented 5 years ago

Okay, this script is in a working state now, in so much as it:

The events also show up in Amplitude itself, here's a chart showing the new OAuth activity events landing:

However, the logs also show there are a number of invocations that timeout:

D      fxaAmplitudeImport  536316489295563  2019-05-01 08:43:16.186  Function execution took 60002 ms, finished with status: 'timeout'
E      fxaAmplitudeImport                   2019-05-01 08:43:16.187  Error: function execution attempt timed out. Function invocation was interrupted.

These occur without tripping any of the error/debug logging that I've set up, so I don't know where or why they're failing yet. It could be throttling at the Amplitude end, although I'd expect to see error responses for that.

Anyway, I've deleted the cloud function again for now to prevent it spinning pointlessly. The next thing we need to do is look into whether we can sort out batching on the pub/sub end (and maybe that will have an impact on the timeouts too, who knows).

philbooth commented 5 years ago

@jbuck I'm on PTO for a few days now, but if you have a chance to look into batching on the pub/sub end (see comment above) at some point, I'd be interested to know what you find out. And if we can't batch at that end, I guess we need to think about what the best approach is instead.

jbuck commented 5 years ago

I looked into setting up batching somehow, but it’s not possible to change batch settings for PubSub Push (1 message only) or GCS export (all messages over 1 hour, partitioned by $something).

I talked to :jason and he recommended switching to PubSub Pull, where you can manually control PubSub batch settings by message count or byte size.

:pb and I chatted in the Operations meeting and he'll change this PR to use PubSub Pull instead. I will setup something to run this on (probably k8s)

philbooth commented 5 years ago

:pb and I chatted in the Operations meeting and he'll change this PR to use PubSub Pull instead.

Done!

Note that we still receive individual events when pulling from pubsub, so I've retained the dependency on async.cargo to group them into batches of (up to) ten. There is a WORKER_COUNT environment variable to control how much concurrency we do that with, if you want to do all the concurrency at the container level then set that to 1.

The other environment variable that might not be obvious is the value of IGNORED_EVENTS. Based on info from @shane-tomlinson, I believe that should be:

{"fxa_activity - access_token_checked":[{"event_properties":{"oauth_client_id":"1b1a3e44c54fbb58"}},{"event_properties":{"oauth_client_id":"5882386c6d801776"}},{"event_properties":{"oauth_client_id":"3332a18d142636cb"}}],"fxa_activity - access_token_created":[{"event_properties":{"oauth_client_id":"1b1a3e44c54fbb58"}},{"event_properties":{"oauth_client_id":"5882386c6d801776"}},{"event_properties":{"oauth_client_id":"3332a18d142636cb"}}]}

When you run the script successfully the output should look something like this:

~/c/fxa-amplitude-send (pb/pubsub) $ node pubsub
2019-05-07T19:00:50.278Z success, count = 10
2019-05-07T19:00:51.241Z success, count = 2
2019-05-07T19:00:51.426Z success, count = 10
2019-05-07T19:00:52.106Z success, count = 10
2019-05-07T19:00:52.283Z success, count = 10
2019-05-07T19:00:52.519Z success, count = 10
2019-05-07T19:00:52.668Z success, count = 6

I was running against the stage project in Amplitude, and you can see those events arriving here:

https://analytics.amplitude.com/mozilla-corp/chart/ixwqt13

I'm out at the cricket tomorrow, but hopefully that's enough info to deploy this in my absence. If there's any problems, let me know and I can dig into them on Thursday.

philbooth commented 5 years ago

(note, I just force-pushed that last commit again, the message.ack() / message.nack() logic had a stupid bug that I hadn't noticed)

philbooth commented 5 years ago

@jbuck, it occurred to me that this would be much easier to deploy if I kept the pubsub stuff in a separate directory with it's own package.json, so you don't get bogged down with building that nightmare node-parquet dependency inside the container. So I just pushed a commit containing that change, fyi.

irrationalagent commented 5 years ago

@philbooth @jbuck is this the last blocker to getting the oauth events into amplitude? it would be good to have this ASAP because it may affect some trailhead-related metrics

philbooth commented 5 years ago

is this the last blocker to getting the oauth events into amplitude?

Afaik, yep.

it would be good to have this ASAP because it may affect some trailhead-related metrics

:+1: we spoke about it in the Ops meeting just now, it's still being worked on so 🤞

jbuck commented 5 years ago

This is deployed in stage now, getting the following errors:

2019-05-24T22:14:36.748Z https://api.amplitude.com/identify StatusCodeError: 429 - "too_many_requests"
    at new StatusCodeError (/home/node/fxa-amplitude-send/node_modules/request-promise-core/lib/errors.js:32:15)
    at Request.plumbing.callback (/home/node/fxa-amplitude-send/node_modules/request-promise-core/lib/plumbing.js:104:33)
    at Request.RP$callback [as _callback] (/home/node/fxa-amplitude-send/node_modules/request-promise-core/lib/plumbing.js:46:31)
    at Request.self.callback (/home/node/fxa-amplitude-send/node_modules/request/request.js:185:22)
    at Request.emit (events.js:196:13)
    at Request.<anonymous> (/home/node/fxa-amplitude-send/node_modules/request/request.js:1161:10)
    at Request.emit (events.js:196:13)
    at IncomingMessage.<anonymous> (/home/node/fxa-amplitude-send/node_modules/request/request.js:1083:12)
    at Object.onceWrapper (events.js:284:20)
    at IncomingMessage.emit (events.js:201:15)
    at endReadableNT (_stream_readable.js:1130:12)
    at processTicksAndRejections (internal/process/task_queues.js:84:9)

followed by:

2019-05-24T22:14:51.542Z https://api.amplitude.com/httpapi TypeError: Cannot destructure property `message` of 'undefined' or 'null'.
    at /home/node/fxa-amplitude-send/index.js:268:11
    at Array.forEach (<anonymous>)
    at clearMessages (/home/node/fxa-amplitude-send/index.js:265:11)
    at /home/node/fxa-amplitude-send/index.js:116:7

but it does appear to be sending!

jbuck commented 5 years ago

Hm, after that last error it appeared to stop sending though. I decided to change the WORKER_COUNT to 1 and there are no rate-limiting errors, and it's chewing through everything quite nicely

philbooth commented 5 years ago

@jbuck I've set the default worker count to 1 in the script and added a condition so that it handles those 429 errors gracefully if they do occur. Is that what you wanted, or were you hoping for a smarter division of work based on device id so that we could use a greater number of workers?