m-lab / mlab-vis-pipeline

M-Lab Visualization Dataflow pipelines for transforming ndt.all into the needed aggregation tables in bigtable.
2 stars 4 forks source link

Pipeline refactoring and deployment updates #53

Closed iros closed 6 years ago

iros commented 6 years ago

Added prometheus tracking Split up BQ/BT pipelines Changed deployment to use K8s Dockerized Upgraded to Dataflow 1.9.1 Changed permission management for various services to be obtained from K8s. Updated README instructions to match.


This change is Reviewable

pboothe commented 6 years ago

Reviewed 87 of 330 files at r1. Review status: 61 of 220 files reviewed at latest revision, 4 unresolved discussions.


Makefile, line 12 at r2 (raw file):

# KEY_FILE=<key file path> make test_connection
test_connection:
  ./tools/bigtable/test_connection.sh ${API_MODE}

Please use the bash variable substitution trick where the whole thing fails if it is not set.

${THING?}

requirements.txt, line 4 at r2 (raw file):

sqlparse>=0.2.0
google-cloud==0.27.0
google-cloud-happybase==0.26.0

As always, pinning makes me nervous, but if it what we have to do then it is what we have to do. For all the ones that are not >=, please file an issue in the repo.


run_bq.sh, line 11 at r2 (raw file):


set -e
set -x

set -u too please


dataflow/pom.xml, line 1 at r2 (raw file):

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

I have no idea what this is. So if you need a review of it, ask someone else. But it looks fine and if it works then that's okay.


Comments from Reviewable

pboothe commented 6 years ago

Checked some things. Still have others. Incremental comments as I work my way forwards.


Comments from Reviewable

critzo commented 6 years ago

Makefile, line 12 at r2 (raw file):

Previously, pboothe (Peter Boothe) wrote…
Please use the bash variable substitution trick where the whole thing fails if it is not set. ``` ${THING?} ```

So to clarify, are you suggesting that any instance of ${API_MODE} should be ${API_MODE?} ? Same treatment for ${KEY_FILE} ?


Comments from Reviewable

critzo commented 6 years ago

dataflow/pom.xml, line 1 at r2 (raw file):

Previously, pboothe (Peter Boothe) wrote…
I have no idea what this is. So if you need a review of it, ask someone else. But it looks fine and if it works then that's okay.

Me either, but it's needed for building the jar, I believe.


Comments from Reviewable

critzo commented 6 years ago

requirements.txt, line 4 at r2 (raw file):

Previously, pboothe (Peter Boothe) wrote…
As always, pinning makes me nervous, but if it what we have to do then it is what we have to do. For all the ones that are not `>=`, please file an issue in the repo.

Just clarifying- this isn't a blocking comment, right? Are you saying you would like to see an issue filed to test each of these changes?:


Comments from Reviewable

critzo commented 6 years ago

I'm preparing a commit to address my comments and those that @pboothe has identified thus far, and was going to make it to the pipeline-keepalive branch. Any concerns with this? @georgiamoon ?

pboothe commented 6 years ago

Feel free to add a commit that addresses my comments to this PR. That is why I gave partial comments early - to enable that kind of thing.


Reviewed 11 of 330 files at r1. Review status: 72 of 220 files reviewed at latest revision, 4 unresolved discussions.


Makefile, line 12 at r2 (raw file):

Previously, critzo wrote…
So to clarify, are you suggesting that any instance of `${API_MODE}` should be `${API_MODE?}` ? Same treatment for `${KEY_FILE}` ?

Ugh. I misunderstood/misremembered about how Makefiles work. All the $(API_MODE) with parens should be ${API_MODE} with braces but no ?. The question-mark thing is a bash thing, and in Make language, $() and ${} are both Make variables and not bash variables. What I want to ensure is that this make rule is never called without setting the API_MODE shell variable. I had thought that the ? trick would do that, but that's a Bash trick and not a Make trick.

For now, I suspect the best thing is just to be consistent in how refer to the variable. Eventually, I would like us to make a test which I can be sure fails if the variable is unset.


requirements.txt, line 4 at r2 (raw file):

Previously, critzo wrote…
Just clarifying- this isn't a blocking comment, right? Are you saying you would like to see an issue filed to test each of these changes?: * `ipaddr==2.1.11` * `sqlparse==0.2.0`

One issue for both would be fine too. But yes: pinning is a form of technical debt, and I would like that debt to be acknowledged in GitHub issues so that we all know that it is a lurking problem in a particular codebase.


Comments from Reviewable

critzo commented 6 years ago

Makefile, line 12 at r2 (raw file):

Previously, pboothe (Peter Boothe) wrote…
Ugh. I misunderstood/misremembered about how Makefiles work. All the `$(API_MODE)` with parens should be `${API_MODE}` with braces but no `?`. The question-mark thing is a bash thing, and in Make language, `$()` and `${}` are both Make variables and not bash variables. What I want to ensure is that this make rule is never called without setting the `API_MODE` shell variable. I had thought that the `?` trick would do that, but that's a Bash trick and not a Make trick. For now, I suspect the best thing is just to be consistent in how refer to the variable. Eventually, I would like us to make a test which I can be sure fails if the variable is unset.

Ok. I can confirm that these scripts fail when API_MODE and/or KEY_FILE are not included, either in the bash command that runs them or as env vars. So I think we're good.


Comments from Reviewable

critzo commented 6 years ago

@pboothe @georgiamoon 6caa833 addresses our review comments thus far.

pboothe commented 6 years ago

Lots of files mix spaces and tabs. This is to be discouraged. The Tab character (\t, rendered on reviewable as >>) should never appear in a file, unless it is Go code or a Makefile. Please search and replace all tabs with the appropriate number of spaces (and please don't just open every file and search and replace - something like find . -name \*.java | xargs sed -i -e 's/\t/ /g' can do it much more reliably and quickly).

The sheer quantity of JSON files is another code smell that worries me - they should probably instead be generated?

The Makefiles are brittle in a way that worries me.

I have no idea what a Dao is (except that it can not be told?)

I'm glad to see that Travis has been successfully called in.

The shell scripts should all have set -u added.

The python code should be run through our python linter if possible (by using the git-hooks repo as a submodule), and we should find a Java linter too and include it in that repo and use it here.

However, all of the above is a lot of stuff, and the code, as constituted, seems like it works. I'd like to step back from the process, as it is turning from a large code review done in my spare work time into a large code audit not involving the original author that will likely take me another two full-time weeks that I can ill spare; third-party code audits are arduous and painful for everyone.

Fix the comments you want to fix, add issues for the rest. I'll remove myself from the "required reviewers" list to allow the process to proceed forward in the way Georgia thinks is best.


Reviewed 67 of 330 files at r1, 1 of 1 files at r3. Review status: 138 of 220 files reviewed at latest revision, 7 unresolved discussions, some commit checks broke.


Makefile, line 9 at r4 (raw file):


# Call
# API_MODE=sandbox|staging|production \

I think that this approach is bug-prone and would prefer that the makefile explicitly check that the variables are set.


requirements.txt, line 4 at r2 (raw file):

Previously, pboothe (Peter Boothe) wrote…
One issue for both would be fine too. But yes: pinning is a form of technical debt, and I would like that debt to be acknowledged in GitHub issues so that we all know that it is a lurking problem in a particular codebase.

Done.


run_bt.sh, line 8 at r4 (raw file):


set -e
set -x

set -u too


dataflow/pom.xml, line 1 at r2 (raw file):

Previously, critzo wrote…
Me either, but it's needed for building the jar, I believe.

Giant XML files checked into version control are a code smell, so the need for this worries me.


dataflow/src/main/java/mlab/dataviz/entities/BQPipelineRun.java, line 1 at r4 (raw file):

package mlab.dataviz.entities;

Java is so good at requiring work that feels like work but kind of isn't work? It's not quite boilerplate it's just ... Java? This comment is not actionable, but I thought a note of sympathy was appropriate.


dataflow/src/main/java/mlab/dataviz/entities/BQPipelineRunDao.java, line 4 at r4 (raw file):

import java.sql.SQLException;

public interface BQPipelineRunDao {

Dao? What's a Dao? I know that bao is delicious steamed filled buns you can get at dim-sum...

You can rename a bunch of stuff, or just put a comment in this file explaining the term (presumably an acronym? But BQ is capitalized earlier... so much confusion).


Comments from Reviewable

iros commented 6 years ago

Thank you for your comments @pboothe. I'll ignore the ones I agree with, but have a few clarifying questions:

  1. Can you clarify what worries you about the makefile and what you'd like to see different? I don't know how to interpret you comment in an actionable way.

  2. The JSON files are mostly auto generated by the Python scripts. Only a few are done by hand. Which files specifically?

  3. All the Dao things came from the specific version of the datastore API that I had to use due to many a conflict with other versions. I agree that it is overkill but I was modeling this based on what I saw provided by the Google API authors who provided sample code as part of this version. Happy to rework if you have other suggestions that are less opaque.

  4. I was using the Python linter, but I will double check. Can you point me to what you mean when you say "our linter" just to make sure I'm referencing the same thing?

I think this one is for @critzo:

  1. The pom.xml is mavens build file. It cannot be Auto generated and is very delicate. Because most of the API libraries for the various Google services have dependencies that conflict with each other, this order ensures they place nice together.

Comments from Reviewable

pboothe commented 6 years ago

Clarifying my comments:

  1. The fact that in order to call make, you also have to set some shell variables, and the Makefile contains no checks that those shell variables were set, it just (currently) happens to fail because some of those commands fail.

  2. Autogenerated things should not be put in code repos. Only the generation scripts. If the json files are generated, then the generation should be part of the installation and update process, and only the generation code should be checked in.

  3. I just don't know what "Dao" means. Like, why that name? Is it an acronym? I'm asking a far simpler and stupider question than you are imagining :)

  4. Including the linter in pre-commit scripts and as a submodule is very helpful, and means that future people will use the same linter you do. https://github.com/m-lab/git-hooks is the repo I was thinking of.

  5. Brittle huge not-really-human-readable XML files make me sad. If it is what maven needs, then I guess it is what maven needs, but that's a bummer :(

critzo commented 6 years ago

dataflow/data/bigquery/pipeline_config.json, line 7 at r4 (raw file):

        "outputTable": "data_viz.base_downloads_ip_by_day",
        "startDateFromTable": "data_viz.base_downloads_ip_by_day",
        "ndtTable": "`measurement-lab.public_v3_1.ndt_all`",

References in this file to "measurement-lab.public_v3_1.ndt.all" w ill need to be changed to "measurement-lab.release.ndt_all", which I will add in an upcoming commit.


Comments from Reviewable

critzo commented 6 years ago

@iros @georgiamoon While troubleshooting and documenting the application data flow, I was tracing the BigQuery Pipeline and found these static values for staging in the merge-upload-download job. Should these be changed to use variable names from pipeline_config.json?

Also, these lines in AddISPsPipeline.java reference a couple of test tables, data_viz_testing.maxmind_asn_test and data_viz_testing.maxmind_asn_testout. Should these be different?

Also found these rows in AddLocationPipeline.java and these rows in AddLocalTimePipeline.java

iros commented 6 years ago

@critzo - up to you. The main methods in those files are not actually used. You could run them individually, but they were used for testing and that is left over as documentation. We can either remove it, or update it. Those values should just match the format in pipeline-config, otherwise they can be strings.

critzo commented 6 years ago

@iros Thanks for confirming that. We were thinking these were likely leftover from testing and not actually called by the current pipeline.