Add `--normalize-case` option for snake_casing column names

mozilla / jsonschema-transpiler

Compile JSON Schema into Avro and BigQuery schemas

Mozilla Public License 2.0

42 stars 10 forks source link

Add `--normalize-case` option for snake_casing column names #79

Closed acmiyaguchi closed 5 years ago

acmiyaguchi commented 5 years ago

This PR fixes #77 by adding a new option to snake_case all column names in a schema. This should be used by adding a --normalize-case flag to the command. By default, this option is turned off.

I've chosen heck as the casing library, since it seems to have the largest number of active users. It uses the unicode_segmentation crate to find word boundaries and performs snake_casing consistently across mixed casing.

I've refactored the code to remove extra clones and to make the order of the functions flow better when reading top-down. I also added a few comments here and there.

acmiyaguchi commented 5 years ago

I've updated the mozilla-pipeline-schema scripts to easily check the diff between different transpiler options.

I've created a diff of the --normalize-case option: https://gist.github.com/acmiyaguchi/3f526c440b67ebe469bcb6ab2da5123f

$ scripts/mps-generate-schemas.sh bq1 --type bigquery --resolve drop
...
80/132 succeded

$ scripts/mps-generate-schemas.sh bq2 --type bigquery --resolve drop --normalize-case
...
80/132 succeded

$ diff -q bq1/ bq2/
Files bq1/coverage.coverage.1.schema.json and bq2/coverage.coverage.1.schema.json differ
Files bq1/eng-workflow.hgpush.1.schema.json and bq2/eng-workflow.hgpush.1.schema.json differ
Files bq1/firefox-launcher-process.launcher-process-failure.1.schema.json and bq2/firefox-launcher-process.launcher-process-failure.1.schema.json differ
Files bq1/mozdata.event.1.schema.json and bq2/mozdata.event.1.schema.json differ
...

$ diff -q bq1/ bq2/ | wc -l
      45

$ diff bq1/ bq2/ > normalize_case.diff

acmiyaguchi commented 5 years ago

There are a couple of interesting cases from the diff that I want to highlight:

l2cacheKB -> l2cache_kb
speedMHz -> speed_m_hz
D2DEnabled -> d2d_enabled
DWriteEnabled -> d_write_enabled
activeGMPlugins -> active_gm_plugins

acmiyaguchi commented 5 years ago

@badboy This PR has changed a bit from the last review, so I'm retagging you for review. We're looking to have a consistent implementation of snake-casing across the transpiler and ingestion, so I reimplemented the logic using regular expressions and string manipulation instead. This still maintains the same output as heck, but is portable to java and python3.

I've added 3 separate test cases to ensure that the behavior stays the same:

alphanum_3 - all strings of length 3 drawn from the alphabet "aA7"
word4 - all strings of length 4 drawn from the alphabet "aA7"
mps-diff-integration - the strings that were generated by diffing mozilla-pipeline-schemas with this PR using heck

I also did the following:

dropped the regex create in favor of onig, a wrapper around oniguruma for lookaround support
made to_snake_case a function accessible via a public interface for testing.

whd commented 5 years ago

We're going to need a version bump for this new option so that we can reference it in MSG.