ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.31k stars 598 forks source link

feat: rename `to_parquet` to `write_parquet` #6909

Open bingbong-sempai opened 1 year ago

bingbong-sempai commented 1 year ago

Is your feature request related to a problem?

No response

Describe the solution you'd like

I know pairing read and to functions follows Pandas' style but I think read and write makes more sense (it's also what polars uses). For example, read_parquet and write_parquet for reading/writing to disk, from_pandas and to_pandas for reading/writing to memory.

What version of ibis are you running?

6.1.0

What backend(s) are you using, if any?

No response

Code of Conduct

lostmygithubaccount commented 1 year ago

my personal opinion is not strictly opposed. though I don't like having a lot of aliases for methods, perhaps this is a good case (and thus wouldn't need to be a breaking change)

bingbong-sempai commented 1 year ago

yeah i agree that too many aliases is a bad thing. it's just that pairing read_ and write_ (same with from_ and to_) feels like the "correct" thing to do.

gforsyth commented 1 year ago

I agree that there's a nice consistency in using read_ and write_ for file-based things and from_ and to_ for memory-based things. It's also going to break a lot of users. I'm not sure that's worth it. It also potentially leaves users thinking that we don't support writing to various filetypes because they're expecting it under to_ and not tab-completing out to write_.

There's also the issue that depending on the backend, we're sometimes uploading a file to, say, Snowflake, which isn't a cheap operation, vs. creating a view based only on parquet metadata, with DuckDB, which IS cheap.

Polars also has a scan_parquet method to distinguish this kind of behavior, but if we follow that convention we'd end up with a different API for a given backend, because one would only support read_ and the other would only support scan_.

Overall, I'm inclined to leave things as they are.

I am a -1 on adding an alias.

bingbong-sempai commented 1 year ago

that's unfortunate, i was hoping ibis would be able to make API changes that pandas couldn't

bingbong-sempai commented 1 year ago

an option can be to alias read_ to from_ instead?
from_ can cover both read and scan operations.
that way io could pair from_ and to_.

gforsyth commented 1 year ago

Hey @bingbong-sempai -- I think we're going to stick with our current convention. Most users are probably coming from pandas and there are other libraries that follow the convention. There's also a benefit in seeing all of the targets for commands like read_ in one block when using tab-completion. Hopefully you find other parts of Ibis to enjoy and this API choice isn't a blocker for you.

Going to close this out -- thanks!

lostmygithubaccount commented 1 year ago

just to add on a bit -- write makes sense for files (CSV, parquet) but less sense for other things Ibis can output (PyArrow tables, PyArrow batches, pandas, torch, etc.)

bingbong-sempai commented 6 months ago

Just to add, both Polars and DuckDB use write_parquet and not to_parquet.

lostmygithubaccount commented 6 months ago

not sure how we feel about it in general as the Ibis project, but I'm personally not opposed (similar to my earlier comment) to adding aliases for common spellings of existing APIs -- e.g. a write_parquet method that perhaps we don't even put on the website (or do and note it's just an alias of to_parquet)

I'll reopen this so it shows in our issue triage and we can discuss

kszucs commented 6 months ago

My preference is to use write_<parquet|csv|...> which triggers computation. I would like to have lazy counterparts eventually which could use to_<format> prefix but no strong opinions there. I think methods triggering actions should be verbs.

jcrist commented 6 months ago

I would like to have lazy counterparts eventually which could use to_<format> prefix but no strong opinions there.

Note that we already have to_pandas/to_pyarrow/to_polars which run eagerly and return an in memory result. I think those names are good, but mixing to_* methods that don't run eagerly in with them would be confusing IMO.

I'm not opposed to a convention of:

That said, I don't find the current convention of putting both of these under to_* bad. A user can always do expr.to_<TAB> to see all the things they can convert an expression into.

The only thing I'm against is having any aliases where both expr.to_parquet and expr.write_parquet exist and do the same thing. As a user I would find that confusing and would spend a bunch of time trying to figure out if/how the expr.to_* and expr.write_* methods differed. (to be clear, short lived aliases for deprecation/migration are fine).