pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.34k stars 1.86k forks source link

BigQuery integration incorrect for list type columns #16938

Closed lmmx closed 2 weeks ago

lmmx commented 3 months ago

Checks

Reproducible example

If I write a list[int] type field (let's say linked_ids) from Polars to BQ with the example code

# Write DataFrame to stream as parquet file; does not hit disk
with io.BytesIO() as stream:
    df.write_parquet(stream)
    stream.seek(0)
    job = client.load_table_from_file(
        stream,
        destination='tablename',
        project='projectname',
        job_config=bigquery.LoadJobConfig(
            source_format=bigquery.SourceFormat.PARQUET,
        ),
    )
job.result()  # Waits for the job to complete

What you get is a record named "linked_ids" within which is a record named "list" within which is a repeated integer field named "item"!

Screenshot from 2024-06-13 15-49-11

I can of course create this manually: create a BigQuery table with an integer column and an array of integers column

CREATE TABLE `your-project.your_dataset.LouisExample` (
    id INT64,
    linked_ids ARRAY<INT64>
);

and write to it

INSERT INTO `your-project.your_dataset.LouisExample` (id, linked_ids)
VALUES
    (1, [101, 102, 103]),
    (2, [201, 202]),
    (3, [301, 302, 303, 304]);

Log output

No response

Issue description

I can see that this comes from pyarrow.parquet.write_table but this causes a major discontinuity between Polars and BigQuery.

It's well known that list field types are one of Polars' main USPs, so this is an equally major hurdle to encounter when uploading to a database source which is compatible with array-type fields.

Expected behavior

I'd like some way to upload list-type columns, and am unclear on what the best approach to get that here is.

To my understanding the point of streaming parquet is for compression and automatic schema application. Here the schema is essentially being lost in transit.

Perhaps the right approach is to go via JSON [without schema interference]? :thinking:

I appreciate this is in a dependency but it's also being suggested as how to use this library so I think it falls in the remit of Polars development to consider how to make it work, so I hope it's not seen as out of place to raise it in this issue tracker. For me it's a major usability concern.

Installed versions

``` --------Version info--------- Polars: 0.19.16 Index type: UInt32 Platform: Linux-5.15.0-87-generic-x86_64-with-glibc2.31 Python: 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0] ----Optional dependencies---- adbc_driver_sqlite: cloudpickle: connectorx: deltalake: fsspec: gevent: matplotlib: 3.5.2 numpy: 1.24.3 openpyxl: pandas: 1.5.1 pyarrow: 9.0.0 pydantic: 2.5.3 pyiceberg: pyxlsb: sqlalchemy: 1.4.49 xlsx2csv: xlsxwriter: ```
lmmx commented 3 months ago

I've confirmed that JSON works instead of Parquet for nested dtype fields:

        job = bq_client.load_table_from_file(
            stream,
            destination=dst_table,
            project=bq_project_id,
            job_config=bigquery.LoadJobConfig(
                schema=[
                    bigquery.SchemaField("id", "INT64"),
                    bigquery.SchemaField("linked_ids", "INT64", mode="REPEATED"),
                ],
                source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
            ),
        )
deanm0000 commented 3 months ago

Are you aware of any library that can produce a parquet file that you can load into bigquery and have it load the way you expect?

If the answer is no then there's nothing for polars to do and it's something google needs to fix.

If yes, then please supply such an example.

henryharbeck commented 2 weeks ago

I believe this bug is on the BigQuery side so I have raised it there - https://github.com/googleapis/python-bigquery/issues/2008

I've reproduced the issue with parquet files written by Polars and PyArrow - so nothing seems to be astray on the Polars side.

@lmmx - I appreciate that is is frustrating to not have the method suggested in the user guide work, but in this case there is nothing Polars can do about it. Please feel free to thumbs up or comment on the BQ issue I raised.

henryharbeck commented 2 weeks ago

@lmmx @deanm0000 @alexander-beedie - I believe this can now be closed if one of you could please do so. There is nothing for Polars to do here. Details in previous comment.