pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.28k stars 1.96k forks source link

Lists are considered nested for write_csv() but not unnest() #17966

Closed aofarrel closed 3 months ago

aofarrel commented 3 months ago

Checks

Reproducible example

import polars as pl

this_works = pl.DataFrame( {"BioSample": ["SAMEA1706673", "SAMN19657950", "SAMEA7156910"], 
    "run_accession": ["ERR257886", "SRR14782548", ["ERR5864736", "ERR5864735"]],
    "assay_type": [["Illumina"], ["PacBio"], "Illumina"]
    }, strict=False)

this_works.write_csv('./tuberculosis.tsv', separator='\t', include_header=True, null_value='')

this_does_not_work = pl.DataFrame( {"BioSample": ["SAMEA1706673", "SAMN19657950", "SAMEA7156910"], 
    "run_accession": ["ERR257886", "SRR14782548", ["ERR5864736", "ERR5864735"]],
    "assay_type": [["Illumina"], ["PacBio"], ["Illumina"]]
    }, strict=False)

# attempting to avoid the ComputeError in write_csv() by unnesting the "assay_type" column fails:
# this_does_not_work.unnest("assay_type") 
# throws "polars.exceptions.SchemaError: invalid series dtype: expected `Struct`, got `list[str]`"

# throws "polars.exceptions.ComputeError: CSV format does not support nested data"
this_does_not_work.write_csv('./tuberculosis.tsv', separator='\t', include_header=True, null_value='')

Log output

File "/Users/aofarrel/Documents/Open-Data Wrangling/v9/test_polars_issue.py", line 22, in <module>
    this_does_not_work.write_csv('./tuberculosis.tsv', separator='\t', include_header=True, null_value='')
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/polars/dataframe/frame.py", line 2696, in write_csv
    self._df.write_csv(
polars.exceptions.ComputeError: CSV format does not support nested data

Issue description

Whether or not a list is considered "nested data" is inconsistent.

Expected behavior

Installed versions

``` --------Version info--------- Polars: 1.3.0 Index type: UInt32 Platform: macOS-13.6.7-x86_64-i386-64bit Python: 3.11.4 (v3.11.4:d2340ef257, Jun 6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: great_tables: hvplot: matplotlib: 3.8.2 nest_asyncio: numpy: 1.25.2 openpyxl: pandas: 1.5.3 pyarrow: pydantic: pyiceberg: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```
deanm0000 commented 3 months ago

write_csv() allows writing a column with types list(str), list(str), and str implies that list(str) is not nested data write_csv() does not allow writing a column with types list(str), list(str), and list(str)

I think that is a bug.

To your main point, unnest maybe a bit of a misnomer but it only unnests structs, it isn't intended to be used for lists (or arrays) so that's not a bug.

ritchie46 commented 3 months ago

I don't fully understand? You cannot write list data to csv? You claim that it is allowed and not allowed?

unnest unnests struct data. explode unnests list data. These are different types of nesting. I am not convinced there is a bug here.

deanm0000 commented 3 months ago

@ritchie46 I thought they were saying that if you have a df that specifically has List(str), List(str), str that it would write a csv but if it were List(str), List(str), List(str) then it wouldn't allow it. I just tried it and in both cases, it doesn't allow it so I'm really confused by:

write_csv() allows writing a column with types list(str), list(str), and str implies that list(str) is not nested data write_csv() does not allow writing a column with types list(str), list(str), and list(str)

deanm0000 commented 3 months ago

I see where the issue is now.

pl.DataFrame( {"BioSample": ["SAMEA1706673", "SAMN19657950", "SAMEA7156910"], 
    "run_accession": ["ERR257886", "SRR14782548", ["ERR5864736", "ERR5864735"]],
    "assay_type": [["Illumina"], ["PacBio"], "Illumina"]
    }, strict=False)

This doesn't create a df of list(str), list(str), and str because you don't consistently have a list in the run_accession or assay_type columns AND because you tell it strict=False it converts them into strings so you just have str, str, str.

In this case

pl.DataFrame( {"BioSample": ["SAMEA1706673", "SAMN19657950", "SAMEA7156910"], 
    "run_accession": ["ERR257886", "SRR14782548", ["ERR5864736", "ERR5864735"]],
    "assay_type": [["Illumina"], ["PacBio"], ["Illumina"]]
    }, strict=False)

your assay_type is properly a list(str) and this is the column that write_csv can't deal with.

You can convert the list(str) to a str that write_csv can write if you do this first.

df.with_columns(
    (
        pl.lit("[")
        + pl.col(x).list.eval(pl.lit('"') + pl.element() + pl.lit('"')).list.join(",")
        + pl.lit("]")
    ).alias(x)
    for x, y in df.schema.items()
    if y == pl.List(pl.String)
)

you can put that in a function like

def make_csv_ready(df):
    return df.with_columns(
    (
        pl.lit("[")
        + pl.col(x).list.eval(pl.lit('"') + pl.element() + pl.lit('"')).list.join(",")
        + pl.lit("]")
    ).alias(x)
    for x, y in df.schema.items()
    if y == pl.List(pl.String)
)

and then you can either use it directly as in make_csv_ready(this_does_not_work).write_csv(somefile) or you can use it with pipe as in this_does_not_work.pipe(make_csv_ready).write_csv(somefile). Hell, you could even make the function include the write_csv step so it's just one call.

side note: is there a way to make this work with pl.col(pl.List(pl.String)) instead of the generator? The name of this expression by default would be 'literal' since that's the first expression so it needs to be aliased.