m-lab / etl-gardener

Gardener provides services for maintaining and reprocessing mlab data.
Apache License 2.0
13 stars 5 forks source link

ndt5 Load from GCP has inconsistent test_id string #294

Open gfr10598 opened 4 years ago

gfr10598 commented 4 years ago

Some partitions end up with perfectly duplicated rows. The difference appears to be that the test_id has quotes in half, and no quotes in the other half.

No other field seems to have quoted strings. When the rows are not duplicated, the test_ids are quoted. This seems to happen on fairly random dates in September and October. It is most likely related to parsing, so it may not be repeatable or stable.

2405 | 2019-10-19 12:01:00.398967 UTC | 2019/10/19/ndt-jd6jr_1565905163_00000000000B17C7.json | a56f857 |   2406 | 2019-10-19 12:01:00.398967 UTC | "2019/10/19/ndt-jd6jr_1565905163_00000000000B17C7.json" | a56f857 |   2407 | 2019-10-19 12:01:00.403030 UTC | "2019/10/19/ndt-d6vhk_1566050090_00000000001EE9C1.json" | a56f857 |   2408 | 2019-10-19 12:01:00.403030 UTC | 2019/10/19/ndt-d6vhk_1566050090_00000000001EE9C1.json | a56f857 |   2409 | 2019-10-19 12:01:00.423659 UTC | 2019/10/19/ndt-94qbs_1565918894_0000000000041C06.json | a56f857 |   2410 | 2019-10-19 12:01:00.423659 UTC | "2019/10/19/ndt-94qbs_1565918894_0000000000041C06.json" | a56f857

Screen Shot 2020-06-26 at 9 10 01 AM