Closed RobertTStrong closed 7 years ago
I suspect this will have to be handled separately from this...
The current notebook https://github.com/mozilla-services/data-pipeline/blob/master/reports/update-orphaning/Update%20orphaning%20analysis%20using%20longitudinal%20dataset.ipynb writes the JSON files and they appear to be copied to the following location https://s3-us-west-2.amazonaws.com/telemetry-public-analysis-2/data/spohl@mozilla.com/Update+Orphaning+View/
This end result destination for this new notebook should be along the lines of https://s3-us-west-2.amazonaws.com/telemetry-public-analysis-2/data/rstrong@mozilla.com/Update+Orphaning+View/
Please have a look at our contribution guidelines. For example, in this case, the commits should be squashed into a single one. The title of the commit should use the imperative mood and ideally also have a description.
@RobertTStrong a few general questions before I start reviewing this:
And a couple of requests:
Could you please provide some context for these change either here, or in the bug, or as a link to the docs?
What kind of context would you like? This is to get data from telemetry and save it as JSON so it can be used by the dashboard. Could you provide me with a link of an existing production notebook doc so I can use it as a guideline?
Is this notebook or a previous version of it already used anywhere in a production environment? (I want to understand how stable this code should be considered)
This notebook is practically a complete rewrite of the previous one along with a lot of additions. The data is present in the same location as this notebook saves it to which is: https://analysis-output.telemetry.mozilla.org/app-update/data/out-of-date/
You can see the web page (already reviewed but it won't be published until after this review) here: http://exchangecode.com/robert/work/temp.html
The notebook is scheduled to run every Sunday and has successfully run twice as scheduled and twice with the date manipulated to get data from previous weeks. I also used it to manually to generate earlier data going back to mid October 2016. All of this data is in the location mentioned above.
Did you test it on a real dataset already?
Yes, see above
What are your expectations for this review? (A full review of every [LOC] vs only review specific parts)
I've tried to optimize this notebook and in its early stages I was able to get the same dataset as the original notebook in around 1/5th of the time. So, that is of interest to me especially with future work I will likely need to perform.
Best practices of course.
I think the code is fairly straightforward besides the use of variables to store common sql strings so the use of sql and rdd's should be reviewed somewhat easily. The strings are also printed so it is simple to see the end result.
The understanding of what the data represents in telemetry is as I see it out of scope for this review since it would require understanding the what and how for each histogram. I will run the notebook by someone that does have an understanding of those aspects.
And a couple of requests:
What's the difference between 8b400e6 and 7d5445e ? The commit messages are almost identical, could you please either them or add a commit description?
That was fee72a8 and was squashed. Specifically, I updated the notebook to save the file to a specific location vs. the generic location the file is saved for a scheduled spark job.
This repository doesn't have any merge commit; it would be great if you removed the merge commit and use a rebase instead.
Will do
The dashboard you linked to and the rest of the info in your comment gives me good context, thanks.
Since I don't use git that often I thought it best to check if b7d4201 properly removed the merge commit, etc.
I'm happy with reviewing a single commit, thanks for squashing the old commits. I'll go ahead and review the code and if it needs a rebase we will do it before merging it.
Wow, this is a lot of code for a notebook. Please consider moving this code to a separate library for the future, so that you can write tests and what not. That is probably your best option with a notebook containing so much logic. Here is some general feedback, the rest is inline:
Perhaps someday but this is also used to further investigate the results so it is quite handy to have it in its current form.
There is a repeating pattern in the code where a mapper function is defined, it's then applied to the pings, the partial counts are saved somewhere and the filtered results are passed to the next mapper. It may be easier to understand if this was defined as a loop over the list of mapper functions. This won't change anything in terms of performance of course, but it may make easier to review it.
Once again, having it in this format assists with investigation. Just a few weeks ago I noticed that there was an increase of no updates found in out of date versions and was able to easily figure out why by adding a couple of additional columns to the sql and examining the values for a around 10 records.
There are several code style issues, like the use of camelCase instead of snake_case and the presence of redundant parenthesis. Not a big deal, but again it makes things less readable.
I changed all of the python to snake_case How do you recommend handling long lines? As I understand it the python style guide recommends parenthesis. https://www.python.org/dev/peps/pep-0008/#multiline-if-statements
One last thing, it would be interesting to see if there is any performance benefit in converting those mapper functions in spark udf and only use Dataframes instead of RDDs. I'll let you decide if that's something worth to try or not.
I much prefer working with dataframes but my attempts to do so ended up with worse performance so I went with rdd's.
I forgot to comment out no_upload = True. Hopefully that is ok for this review.
Actually, it might be better if the notebook landed with no_upload = True in case someone else runs it. That way they won't accidentally overwrite the data.
Is there a time when the longitudinal parquet is guaranteed to be completed? I noticed that the results were different in last week's data set from when it was scheduled vs. running it a few days later. The results from a couple of other week's scheduled runs didn't change.
Perhaps someday but this is also used to further investigate the results so it is quite handy to have it in its current form.
sure, I see your point; if one day you feel like converting it to a proper python etl script, @harterrt just created a template project for that.
I changed all of the python to snake_case How do you recommend handling long lines? As I understand it the python style guide recommends parenthesis. https://www.python.org/dev/peps/pep-0008/#multiline-if-statements
Yes, parenthesis are generally preferred over terminating lines with backslashes.
I much prefer working with dataframes but my attempts to do so ended up with worse performance so I went with rdd's.
sounds good
Is there a time when the longitudinal parquet is guaranteed to be completed? I noticed that the results were different in last week's data set from when it was scheduled vs. running it a few days later. The results from a couple of other week's scheduled runs didn't change.
The longitudinal job runs once a week at midnight on Sunday morning UTC. It usually takes between 3h and 6h. There is no guarantee though that the job will be completed. In some cases there may be temporary failures and what not, so the job may take much more time to complete. Datasets that require some coordination with the longitudinal job are usually scheduled using Airflow and it's dependency graph functionality. There's no way to do that from ATMO at the moment.
This is review is not fully comprehensive. Please ask someone with better domain knowledge to review it as well (sorry if I stress this 😺 )
Not a problem and will do so.
mhowell went through the histograms and gave a thumbs up so this can land.
Notebook for the new update out of date dashboard. The web page is being update in https://github.com/mozilla/telemetry-dashboard/pull/286