Lists are considered nested for write_csv() but not unnest()

aofarrel commented 3 months ago

Checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

this_works = pl.DataFrame( {"BioSample": ["SAMEA1706673", "SAMN19657950", "SAMEA7156910"], 
    "run_accession": ["ERR257886", "SRR14782548", ["ERR5864736", "ERR5864735"]],
    "assay_type": [["Illumina"], ["PacBio"], "Illumina"]
    }, strict=False)

this_works.write_csv('./tuberculosis.tsv', separator='\t', include_header=True, null_value='')

this_does_not_work = pl.DataFrame( {"BioSample": ["SAMEA1706673", "SAMN19657950", "SAMEA7156910"], 
    "run_accession": ["ERR257886", "SRR14782548", ["ERR5864736", "ERR5864735"]],
    "assay_type": [["Illumina"], ["PacBio"], ["Illumina"]]
    }, strict=False)

# attempting to avoid the ComputeError in write_csv() by unnesting the "assay_type" column fails:
# this_does_not_work.unnest("assay_type") 
# throws "polars.exceptions.SchemaError: invalid series dtype: expected `Struct`, got `list[str]`"

# throws "polars.exceptions.ComputeError: CSV format does not support nested data"
this_does_not_work.write_csv('./tuberculosis.tsv', separator='\t', include_header=True, null_value='')

Log output

File "/Users/aofarrel/Documents/Open-Data Wrangling/v9/test_polars_issue.py", line 22, in <module>
    this_does_not_work.write_csv('./tuberculosis.tsv', separator='\t', include_header=True, null_value='')
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/polars/dataframe/frame.py", line 2696, in write_csv
    self._df.write_csv(
polars.exceptions.ComputeError: CSV format does not support nested data

Issue description

Whether or not a list is considered "nested data" is inconsistent.

write_csv() allows writing a column with types list(str), list(str), and str
- implies that list(str) is not nested data
write_csv() does not allow writing a column with types list(str), list(str), and list(str)
unnest() does not work on a column with types list(str), list(str), and list(str)
- implies that list(str) is not nested data, until you try to write_csv() it, then it's nested?

Expected behavior

If a column is considered to have nested data and therefore cannot be written to a CSV, unnest() should work on that column
write_csv() will write a column with mixed types even if one of those types are a list. Therefore, write_csv() should allow writing a column that is entirely lists. For comparison, pandas allows this.
- Alternative: write_csv() only supports writing lists (in mixed-type and homogeneous columns) if provided a non-comma separator value like '\t'

Installed versions

``` --------Version info--------- Polars: 1.3.0 Index type: UInt32 Platform: macOS-13.6.7-x86_64-i386-64bit Python: 3.11.4 (v3.11.4:d2340ef257, Jun 6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: great_tables: hvplot: matplotlib: 3.8.2 nest_asyncio: numpy: 1.25.2 openpyxl: pandas: 1.5.3 pyarrow: pydantic: pyiceberg: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```

deanm0000 commented 3 months ago

write_csv() allows writing a column with types list(str), list(str), and str implies that list(str) is not nested data write_csv() does not allow writing a column with types list(str), list(str), and list(str)

I think that is a bug.

To your main point, unnest maybe a bit of a misnomer but it only unnests structs, it isn't intended to be used for lists (or arrays) so that's not a bug.

ritchie46 commented 3 months ago

I don't fully understand? You cannot write list data to csv? You claim that it is allowed and not allowed?

unnest unnests struct data. explode unnests list data. These are different types of nesting. I am not convinced there is a bug here.

deanm0000 commented 3 months ago

@ritchie46 I thought they were saying that if you have a df that specifically has List(str), List(str), str that it would write a csv but if it were List(str), List(str), List(str) then it wouldn't allow it. I just tried it and in both cases, it doesn't allow it so I'm really confused by:

write_csv() allows writing a column with types list(str), list(str), and str implies that list(str) is not nested data write_csv() does not allow writing a column with types list(str), list(str), and list(str)

deanm0000 commented 3 months ago

I see where the issue is now.

pl.DataFrame( {"BioSample": ["SAMEA1706673", "SAMN19657950", "SAMEA7156910"], 
    "run_accession": ["ERR257886", "SRR14782548", ["ERR5864736", "ERR5864735"]],
    "assay_type": [["Illumina"], ["PacBio"], "Illumina"]
    }, strict=False)

This doesn't create a df of list(str), list(str), and str because you don't consistently have a list in the run_accession or assay_type columns AND because you tell it strict=False it converts them into strings so you just have str, str, str.

In this case

pl.DataFrame( {"BioSample": ["SAMEA1706673", "SAMN19657950", "SAMEA7156910"], 
    "run_accession": ["ERR257886", "SRR14782548", ["ERR5864736", "ERR5864735"]],
    "assay_type": [["Illumina"], ["PacBio"], ["Illumina"]]
    }, strict=False)

your assay_type is properly a list(str) and this is the column that write_csv can't deal with.

You can convert the list(str) to a str that write_csv can write if you do this first.

df.with_columns(
    (
        pl.lit("[")
        + pl.col(x).list.eval(pl.lit('"') + pl.element() + pl.lit('"')).list.join(",")
        + pl.lit("]")
    ).alias(x)
    for x, y in df.schema.items()
    if y == pl.List(pl.String)
)

you can put that in a function like

def make_csv_ready(df):
    return df.with_columns(
    (
        pl.lit("[")
        + pl.col(x).list.eval(pl.lit('"') + pl.element() + pl.lit('"')).list.join(",")
        + pl.lit("]")
    ).alias(x)
    for x, y in df.schema.items()
    if y == pl.List(pl.String)
)

and then you can either use it directly as in make_csv_ready(this_does_not_work).write_csv(somefile) or you can use it with pipe as in this_does_not_work.pipe(make_csv_ready).write_csv(somefile). Hell, you could even make the function include the write_csv step so it's just one call.

side note: is there a way to make this work with pl.col(pl.List(pl.String)) instead of the generator? The name of this expression by default would be 'literal' since that's the first expression so it needs to be aliased.

pola-rs / polars