mozilla / crashathon

2 stars 3 forks source link

Automated Spark job #1

Open robhudson opened 8 years ago

robhudson commented 8 years ago

We need to figure out the automated spark jobs to write to the shared private s3 bucket.

Telemetry stated that they have a private S3 bucket that is shared where we can write data that may contain sensitive information (e.g. clientIds). This spark job will need to grab the daily set of crash pings and write them to S3 (from which our system will pull them down and load them into Elasticsearch).

jezdez commented 8 years ago

Example iPython notebook from @robhudson screenshot 2016-04-06 11 32 46

jezdez commented 8 years ago

@robhudson Can you elaborate if the reason to use the existing private bucket is purely a convenience to re-use an existing one or is this a requirement for security reasons? Could we create an own private bucket as well to be less dependent on the metrics team?

robhudson commented 8 years ago

Good question. From the little information I gathered the shared private bucket was only for convenience. It sounded like there is no issue to use boto and create our own bucket. To do so I think we'd need to get the AWS keys in the spark job somehow?

jezdez commented 8 years ago

I looked more into the current telemetry infrastructure, how it's provisioned and where we come in to run our Spark job to fill the datastore with data for our API.

An initial thought was to submit Spark jobs via Heroku's pseudo-cron to have a smaller than 1 day window for the backfill job for our datastore.

The Spark clusters that are provisioned via https://analysis.telemetry.mozilla.org/cluster run on AWS Elastic Map Reduce and lots of custom configuration management and are killed after 24 hours automatically.

We can't easily run our own Spark cluster without non-trivial amount of resources (let alone on Heroku), e.g. the default EC2 node type for the a.t.m.o clusters are c3.4xlarge.

So if we can live with backfilling data from Spark jobs to our datastore only once a day, we should simply use https://analysis.telemetry.mozilla.org/cluster/schedule to schedule a daily run of a iPython notebook that writes the data we want to S3.

Then we write another importer that fetches that data from S3 and put it into our datastore. That way we can blow away the datastore and reindex from S3.

jezdez commented 8 years ago

Further updates:

I can't create a new bucket via a 24h Spark cluster and probably have to ask the Telemetry team to do it for us.

We were also advised by Mark Reid that we should not store our own S3 access credentials in a Python notebook. That means – assuming that the credentials inherited by the 24h Spark cluster only has read-only permissions for a Telemetry-created S3 bucket – that we can't easily write to S3 from a Spark job on the current infrastructure. I'll try to get confirmation from someone on the Telemetry team about this theory.

robhudson commented 8 years ago

There is an option to store data in a public S3 bucket. To do so we would need to make sure we don't have any sensitive data in our data. I think sensitive data includes clientId, however, which is what we intended to aggregate on.

jezdez commented 8 years ago

I met with Georg Fritzsche today to ask him about storing data and he pointed me to Mark Reid, and pointed me to a few more examples. One thing he noted is that we should be able to re-use existing buckets to write data publicly as you say.

Examples: https://github.com/mfinkle/user-data-analytics/blob/master/android-clients.ipynb https://github.com/mfinkle/user-data-analytics/blob/master/android-events.ipynb https://github.com/mozilla-services/data-pipeline/blob/master/reports/fennec_dashboard/summarize_csv.ipynb

The latter has a "CSV and S3 utility functions" section.

There is now also a list of software projects, that could be interesting for further digestion: https://wiki.mozilla.org/CloudServices/DataPipeline#Code