ENH: allow writing series to parquet file

pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

https://pandas.pydata.org

BSD 3-Clause "New" or "Revised" License

43.8k stars 17.98k forks source link

ENH: allow writing series to parquet file #54638

Open lcrmorin opened 1 year ago

lcrmorin commented 1 year ago

Feature Type

[X] Adding new functionality to pandas
[ ] Changing existing functionality in pandas
[ ] Removing existing functionality in pandas

Problem Description

Currently the .to_parquet() method only work for dataframe. It would be nicer if the method could work on Series to. Currently we either have to save the series to another format or involve a pd.DataFrame(Serie) which seems a bit clunky.

Feature Description

For a given pandas Serie, being able to write Serie.to_parquet()

Alternative Solutions

Currently the two alternatives are:

save to another format which is a bit convoluted as we now have to deal with multiple formats.
convert the series to a DataFrame to use the DF method.

Additional Context

No response

rhshadrach commented 1 year ago

If we added a Series.to_parquet, I think users would expect to be able to round trip back to Series. I'm not sure but I don't think that's possible.

I personally use ser.to_frame(name).to_parquet(...).

cc @jorisvandenbossche

jorisvandenbossche commented 1 year ago

If we added a Series.to_parquet, I think users would expect to be able to round trip back to Series.

We have other IO methods on Series that doesn't necessarily give you that guarantee. For example, when reading the result of Series.to_csv with pd.read_csv, you will also get a DataFrame, I think.

So from that point of view, I would personally be fine with such a non-perfect roundtripping behaviour for Series.to_parquet as well.

The question is if we want to add all of our IO methods to Series as well in general, or not (given that the workaround is quite easy). It seems we are now a bit inconsistent.

sammcbeth commented 1 year ago

take

sammcbeth commented 1 year ago

Assigning this to myself as it seems like a good first issue for me given I use pandas with parquet files regularly. Seems like there's still some ongoing discussion around the appropriateness of this so Ill keep an eye out if people decide this is no longer needed

sammcbeth commented 1 year ago

@jorisvandenbossche this will need much more testing but I got it working locally and I wanted to get some initial validation on the idea https://github.com/pandas-dev/pandas/pull/54675/files

Alternatively we could do what Series.to_markdown() does here and simply cast the series to a frame and use the frames methods. I figured this wasn't as clean / easy to write unit tests for. Let me know if I have the right idea above whenever you have a chance. Thanks!

rhshadrach commented 1 year ago

We have other IO methods on Series that doesn't necessarily give you that guarantee. For example, when reading the result of Series.to_csv with pd.read_csv, you will also get a DataFrame, I think.

So from that point of view, I would personally be fine with such a non-perfect roundtripping behaviour for Series.to_parquet as well.

I expect a lot more out of parquet than I do CSV/JSON/Excel, in particular round tripping with dtypes. I'm not so convinced that a comparison to CSV is warranted.

Do all IO methods rountrip back as a DataFrame? If that's the case, then I don't think it's worth the maintenance burden to have these methods on Series when they are just a .to_frame() call away. But if there is good reason to keep some of them, then I can see the value that having them all on Series bring for a consistent API.