Closed lmmx closed 2 weeks ago
I've confirmed that JSON works instead of Parquet for nested dtype fields:
write_parquet
use write_ndjson
BytesIO
use StringIO
job = bq_client.load_table_from_file(
stream,
destination=dst_table,
project=bq_project_id,
job_config=bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("id", "INT64"),
bigquery.SchemaField("linked_ids", "INT64", mode="REPEATED"),
],
source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
),
)
Are you aware of any library that can produce a parquet file that you can load into bigquery and have it load the way you expect?
If the answer is no then there's nothing for polars to do and it's something google needs to fix.
If yes, then please supply such an example.
I believe this bug is on the BigQuery side so I have raised it there - https://github.com/googleapis/python-bigquery/issues/2008
I've reproduced the issue with parquet files written by Polars and PyArrow - so nothing seems to be astray on the Polars side.
@lmmx - I appreciate that is is frustrating to not have the method suggested in the user guide work, but in this case there is nothing Polars can do about it. Please feel free to thumbs up or comment on the BQ issue I raised.
@lmmx @deanm0000 @alexander-beedie - I believe this can now be closed if one of you could please do so. There is nothing for Polars to do here. Details in previous comment.
Checks
Reproducible example
If I write a
list[int]
type field (let's saylinked_ids
) from Polars to BQ with the example codeWhat you get is a record named "linked_ids" within which is a record named "list" within which is a repeated integer field named "item"!
I can of course create this manually: create a BigQuery table with an integer column and an array of integers column
and write to it
Log output
No response
Issue description
I can see that this comes from
pyarrow.parquet.write_table
but this causes a major discontinuity between Polars and BigQuery.It's well known that list field types are one of Polars' main USPs, so this is an equally major hurdle to encounter when uploading to a database source which is compatible with array-type fields.
Expected behavior
I'd like some way to upload list-type columns, and am unclear on what the best approach to get that here is.
To my understanding the point of streaming parquet is for compression and automatic schema application. Here the schema is essentially being lost in transit.
Perhaps the right approach is to go via JSON [without schema interference]? :thinking:
I appreciate this is in a dependency but it's also being suggested as how to use this library so I think it falls in the remit of Polars development to consider how to make it work, so I hope it's not seen as out of place to raise it in this issue tracker. For me it's a major usability concern.
Installed versions