mozilla / jsonschema-transpiler

Compile JSON Schema into Avro and BigQuery schemas
Mozilla Public License 2.0
42 stars 10 forks source link

Nested list #90

Closed acmiyaguchi closed 4 years ago

acmiyaguchi commented 4 years ago

Nested lists are not handled correctly.

$ echo '{
            "type": "array",
            "items": {
                "type": "array",
                "items": {
                    "type": "array",
                    "items": [
                        {"type": "integer"}
                    ],
                    "additionalItems": false
                }
            }
        }' | jsonschema-transpiler --type bigquery --tuple-struct

results in

[
  {
    "fields": [
      {
        "mode": "REQUIRED",
        "name": "f0_",
        "type": "INT64"
      }
    ],
    "mode": "REPEATED",
    "name": "root",
    "type": "RECORD"
  }
]

This PR fixes this so there is an intermediate layer that can be used for unnesting.

[
  {
    "fields": [
      {
        "fields": [
          {
            "mode": "REQUIRED",
            "name": "f0_",
            "type": "INT64"
          }
        ],
        "mode": "REPEATED",
        "name": "list",
        "type": "RECORD"
      }
    ],
    "mode": "REPEATED",
    "name": "root",
    "type": "RECORD"
  }
]

See this gist for the result of the verification script: https://gist.github.com/acmiyaguchi/619b113f0b536480919ecf90a4028036. This lines up with the experience with the third party modules and untrusted modules pings, which are likely the only pings with nested arrays.

acmiyaguchi commented 4 years ago

I found an issue with the avro code that I fixed, however it shouldn't have any bearing on the BigQuery schemas.

acmiyaguchi commented 4 years ago

I found a bug with the avro code, but I'm fairly confident that everything works as expected now. I've written a few queries on the bigquery and raw ndjson files to verify that the values are correct.

SELECT
  SUM(list.f0_),
  SUM(list.f1_)
FROM
  test_avro.telemetry__untrustedModules_v4,
  UNNEST(root.payload.combinedStacks.stacks) AS stacks,
  UNNEST(stacks.list) AS list
Row f0_ f1_  
1 62325.0 1.3835058055987E21  
cat data/telemetry.untrustedModules.4.ndjson | jq -cr '.payload.combinedStacks.stacks | .[]| .[] | join(",")' | python3 -c "import sys; x=[tuple(map(float, x.split(','))) for x in sys.stdin.readlines()]; print(list(map(sum, zip(*x))))"
[62325.0, 1.3835058055987192e+21]

AND

SELECT
  parent.f1_,
  parent.f2_,
  COUNT(*)
FROM
  test_avro.telemetry__event_v4,
  UNNEST(root.payload.events.parent) parent
GROUP BY
  1,
  2
ORDER BY
  3 DESC
Row f1_ f2_ f0_  
1 addonsManager install 1647  
2 addonsManager update 1596  
3 devtools.main tool_timer 595  
4 addonsManager disable 563  
5 addonsManager enable 555  
6 devtools.main exit 380  
7 devtools.main enter 375  
8 devtools.main close 325  
9 devtools.main open 310  
10 uptake.remotecontent.result uptake 284  
11 addonsManager uninstall 223  
12 devtools.main edit_rule 147  
13 activity_stream end 41  
14 devtools.main execute_js 38  
15 devtools.main activate 19  
16 devtools.main deactivate 15  
17 extensions.data migrateResult 12  
18 activity_stream event 11  
19 devtools.main object_expanded 9  
20 devtools.main pause_on_exceptions 8  
21 devtools.main edit_html 6  
22 devtools.main sidepanel_changed 3  
23 devtools.main filters_changed 3  
24 addonsManager sideload_prompt 2  
25 devtools.main jump_to_definition 1  
26 devtools.main jump_to_source 1  
27 security.ui.identitypopup open 1  
28 security.ui.identitypopup click 1  
cat data/telemetry.event.4.ndjson | jq -cr '.payload.events.parent | select(. != null) | .[] | [.[1], .[2]] | join("|")' | sort | uniq -c | sort -r | sed 's/^[[:space:]]*//g' | awk '{printf "%s|%s\n", $2,$1}'
f1_ f2_ f0_
addonsManager install 1647
addonsManager update 1596
devtools.main tool_timer 595
addonsManager disable 563
addonsManager enable 555
devtools.main exit 380
devtools.main enter 375
devtools.main close 325
devtools.main open 310
uptake.remotecontent.result uptake 284
addonsManager uninstall 223
devtools.main edit_rule 147
activity_stream end 41
devtools.main execute_js 38
devtools.main activate 19
devtools.main deactivate 15
extensions.data migrateResult 12
activity_stream event 11
devtools.main object_expanded 9
devtools.main pause_on_exceptions 8
devtools.main edit_html 6
devtools.main sidepanel_changed 3
devtools.main filters_changed 3
addonsManager sideload_prompt 2
security.ui.identitypopup open 1
security.ui.identitypopup click 1
devtools.main jump_to_source 1
devtools.main jump_to_definition 1