mozilla / jsonschema-transpiler

Compile JSON Schema into Avro and BigQuery schemas
Mozilla Public License 2.0
42 stars 10 forks source link

Add a flag to drop fields instead of casting #76

Closed acmiyaguchi closed 5 years ago

acmiyaguchi commented 5 years ago

This PR should (mostly) fix #75.

The approach in this pull-request is as follows:


At a high-level, we should be dropping fields during transpilation if the subschema is incompatible (tuple validation) or if the schema is underspecified (an empty object). This adds error-handling when translating between schemas (ast -> bigquery and ast -> avro in particular).

I've tried to keep the changes minimal. I've left most of the unit-tests alone aside from refactoring so it is easier to add the context variable. I've added a compatible field to the TestCase object which makes it easy to retrofit the existing tests to determine whether fields will be dropped with the new error handling methods.

See https://github.com/mozilla/jsonschema-transpiler/pull/76#issuecomment-505630943 for a sanity check on the full schema repository.

acmiyaguchi commented 5 years ago

@fbertsch The transpiler should be ready for testing as of 5dc9375. I believe it can be run via cargo install --git https://github.com/acmiyaguchi/jsonschema-transpiler.git --rev 5dc9375. Do you have any suggestions for generating a diff against mozilla-pipeline-schemas using the --resolve drop option?

fbertsch commented 5 years ago

@acmiyaguchi my recommendation for a diff is to run the entire MSG generate_schemas script, but use the --resolve option, and push to a different branch in MPS (I use test_generated_schemas, instead of generated_schemas). Then GH will show an easy diff between the two branches.

acmiyaguchi commented 5 years ago

@fbertsch Thanks! I figured out how to get the diff I wanted. I posted in-channel, but here's the commit showing which fields will be missing:

https://github.com/acmiyaguchi/mozilla-pipeline-schemas/commit/ea5d3889a2ad01612340403f02c1651e7ff49f97

  1. Add the entire telemetry namespace to the allowlist
  2. Run with the following diff applied to MSG

    diff --git a/Dockerfile b/Dockerfile
    index dd46b0a..c9ed47d 100644
    --- a/Dockerfile
    +++ b/Dockerfile
    @@ -24,7 +24,7 @@ ENV CARGO_INSTALL_ROOT=${HOME}/.cargo
    ENV PATH ${PATH}:${HOME}/.cargo/bin
    
    # Install a tagged version of jsonschema-transpiler
    -RUN cargo install jsonschema-transpiler --version 1.0.0
    +RUN cargo install --git https://github.com/mozilla/jsonschema-transpiler.git --rev 951daef
    
    # Upgrade pip
    RUN pip install --upgrade pip
    diff --git a/bin/schema_generator.sh b/bin/schema_generator.sh
    index 4589236..4161c23 100755
    --- a/bin/schema_generator.sh
    +++ b/bin/schema_generator.sh
    @@ -149,7 +149,7 @@ function main() {
     # Add transpiled BQ schemas
     find . -type f -name "*.schema.json" | while read -r fname; do
         bq_out=${fname/schema.json/bq}
    -        jsonschema-transpiler --type bigquery "$fname" > "$bq_out"
    +        jsonschema-transpiler --type bigquery --resolve drop "$fname" > "$bq_out"
     done
    
     # Keep only allowed schemas
  3. Modify --resolve drop to --resolve cast to create a diff from the previous commit

It looks like this is doing the right thing, but I'll make sure to include a more involved test.

acmiyaguchi commented 5 years ago

This is ready for another look!

acmiyaguchi commented 5 years ago

@badboy Thanks for the review! Adding in error handling after the fact was quite a change, but hopefully the changes can be a bit more incremental (I hope...).