mozilla / leanplum-data-export

A docker image that runs a script to export Leanplum data into BigQuery
5 stars 4 forks source link

Allow for jagged data rows #10

Closed fbertsch closed 4 years ago

fbertsch commented 4 years ago

The Leanplum data doesn't always fill in every row with all columns of data. If we allow for jagged row this will make all subsequent columns NULL.

BenWu commented 4 years ago

Do you have an example of this? From what I can tell missing values are already handled by setting the value to null. I might not be understanding correctly.

fbertsch commented 4 years ago

Unfortunately we aren't yet handling it. The error is happening on some incoming CSV files, and we get an error like this:

[2019-11-21 22:28:41,406] {logging_mixin.py:95} INFO - [2019-11-21 22:28:41,406] {pod_launcher.py:104} INFO - google.api_core.exceptions.BadRequest: 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/moz-fx-data-shared-prod/queries/41605445-a4c7-4d63-84e3-6bfca88bd4d0?maxResults=0&location=US: Error while reading table: tmp.leanplum_sessions_v1_20191120, error message: CSV table references column position 35, but line starting at position:49201408 contains only 12 columns.

However, looking over the documentation, I'm not seeing a specific way to create an external table using CSV and allow jagged rows, only when loading data from a CSV. We could use that path, but then we also need to drop those lat/lon columns.

Another option here could be to drop those rows explicitly.

BenWu commented 4 years ago

Would it be an option to preprocess the csvs to pad every row with commas? Should be feasible since the csvs aren't that big

fbertsch commented 4 years ago

That could work, essentially allowing jagged rows ourselves. Give it a shot.

BenWu commented 4 years ago

Actually if we do a load table instead of an insert query we can allow jagged rows https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#csv-options

That seems like the simplest approach

BenWu commented 4 years ago

Or just adding external_config.options.allow_jagged_rows = True might work

fbertsch commented 4 years ago

The key part for an external load is to drop lat/lon. If it works as-is with jagged rows, that would be ideal

BenWu commented 4 years ago

After a long investigation, turns out the issue isn't actually jagged rows: the issue is corrupt input.

Jagged rows like this work fine (for some reason, I would expect to not work unless the config is set):

a,b,c
1,2
1

1,2,3

The rows that were breaking things have a deviceModel value of 'Unknown łB -珠ΙάLY΀'. There's a \n in there and possibly some other encoded things my editor can't read. I confirmed that removing this line fixes things.

A quick test shows that adding the allow_quoted_newlines option to the external config fixes it but I'll run a full test through airflow tomorrow.

fbertsch commented 4 years ago

That's great, thanks for the investigation Ben!