Closed fbertsch closed 4 years ago
Do you have an example of this? From what I can tell missing values are already handled by setting the value to null. I might not be understanding correctly.
Unfortunately we aren't yet handling it. The error is happening on some incoming CSV files, and we get an error like this:
[2019-11-21 22:28:41,406] {logging_mixin.py:95} INFO - [2019-11-21 22:28:41,406] {pod_launcher.py:104} INFO - google.api_core.exceptions.BadRequest: 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/moz-fx-data-shared-prod/queries/41605445-a4c7-4d63-84e3-6bfca88bd4d0?maxResults=0&location=US: Error while reading table: tmp.leanplum_sessions_v1_20191120, error message: CSV table references column position 35, but line starting at position:49201408 contains only 12 columns.
However, looking over the documentation, I'm not seeing a specific way to create an external table using CSV and allow jagged rows, only when loading data from a CSV. We could use that path, but then we also need to drop those lat/lon columns.
Another option here could be to drop those rows explicitly.
Would it be an option to preprocess the csvs to pad every row with commas? Should be feasible since the csvs aren't that big
That could work, essentially allowing jagged rows ourselves. Give it a shot.
Actually if we do a load table instead of an insert query we can allow jagged rows https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#csv-options
That seems like the simplest approach
Or just adding external_config.options.allow_jagged_rows = True
might work
The key part for an external load is to drop lat/lon. If it works as-is with jagged rows, that would be ideal
After a long investigation, turns out the issue isn't actually jagged rows: the issue is corrupt input.
Jagged rows like this work fine (for some reason, I would expect to not work unless the config is set):
a,b,c
1,2
1
1,2,3
The rows that were breaking things have a deviceModel
value of 'Unknown łB -珠ΙάLY'
. There's a \n
in there and possibly some other encoded things my editor can't read. I confirmed that removing this line fixes things.
A quick test shows that adding the allow_quoted_newlines
option to the external config fixes it but I'll run a full test through airflow tomorrow.
That's great, thanks for the investigation Ben!
The Leanplum data doesn't always fill in every row with all columns of data. If we allow for jagged row this will make all subsequent columns
NULL
.