Closed aofarrel closed 3 months ago
write_csv() allows writing a column with types list(str), list(str), and str implies that list(str) is not nested data write_csv() does not allow writing a column with types list(str), list(str), and list(str)
I think that is a bug.
To your main point, unnest
maybe a bit of a misnomer but it only unnests structs, it isn't intended to be used for lists (or arrays) so that's not a bug.
I don't fully understand? You cannot write list data to csv? You claim that it is allowed and not allowed?
unnest
unnests struct data. explode
unnests list data. These are different types of nesting. I am not convinced there is a bug here.
@ritchie46 I thought they were saying that if you have a df that specifically has List(str), List(str), str that it would write a csv but if it were List(str), List(str), List(str) then it wouldn't allow it. I just tried it and in both cases, it doesn't allow it so I'm really confused by:
write_csv() allows writing a column with types list(str), list(str), and str implies that list(str) is not nested data write_csv() does not allow writing a column with types list(str), list(str), and list(str)
I see where the issue is now.
pl.DataFrame( {"BioSample": ["SAMEA1706673", "SAMN19657950", "SAMEA7156910"],
"run_accession": ["ERR257886", "SRR14782548", ["ERR5864736", "ERR5864735"]],
"assay_type": [["Illumina"], ["PacBio"], "Illumina"]
}, strict=False)
This doesn't create a df of list(str), list(str), and str because you don't consistently have a list in the run_accession
or assay_type
columns AND because you tell it strict=False
it converts them into strings so you just have str, str, str.
In this case
pl.DataFrame( {"BioSample": ["SAMEA1706673", "SAMN19657950", "SAMEA7156910"],
"run_accession": ["ERR257886", "SRR14782548", ["ERR5864736", "ERR5864735"]],
"assay_type": [["Illumina"], ["PacBio"], ["Illumina"]]
}, strict=False)
your assay_type
is properly a list(str) and this is the column that write_csv
can't deal with.
You can convert the list(str) to a str that write_csv can write if you do this first.
df.with_columns(
(
pl.lit("[")
+ pl.col(x).list.eval(pl.lit('"') + pl.element() + pl.lit('"')).list.join(",")
+ pl.lit("]")
).alias(x)
for x, y in df.schema.items()
if y == pl.List(pl.String)
)
you can put that in a function like
def make_csv_ready(df):
return df.with_columns(
(
pl.lit("[")
+ pl.col(x).list.eval(pl.lit('"') + pl.element() + pl.lit('"')).list.join(",")
+ pl.lit("]")
).alias(x)
for x, y in df.schema.items()
if y == pl.List(pl.String)
)
and then you can either use it directly as in make_csv_ready(this_does_not_work).write_csv(somefile)
or you can use it with pipe as in this_does_not_work.pipe(make_csv_ready).write_csv(somefile)
. Hell, you could even make the function include the write_csv
step so it's just one call.
side note: is there a way to make this work with pl.col(pl.List(pl.String))
instead of the generator? The name of this expression by default would be 'literal' since that's the first expression so it needs to be aliased.
Checks
Reproducible example
Log output
Issue description
Whether or not a list is considered "nested data" is inconsistent.
Expected behavior
Installed versions