Keep (un-deprecate) `read_table()` for API stability

st-bender commented 5 years ago

Problem description

read_table() got deprecated in favour of using read_csv().

Using read_csv() to read tab/space delimited files is counter-intuitive. According to the docs and the related issues, both share the same code and it is not clear why the one function should be preferred over the other, and that change may even break existing code.

In my point of view read_table() is the more general function and read_csv() is a special case. Why would you deprecate (and then remove) the more useful function? It is already annoying to use to_csv() to write space/tab delimited files. And as I can see it, it comes down to two lines of code.

Proposed solution

Keep both functions as they are (un-deprecate read_table()) or rename the function to have a more general name like read_txt() (as in numpy.genfromtxt()) or similar.

jreback commented 5 years ago

pls reference the issue where this was deprecated

also you might want to highlite and rebut some arguments there

st-bender commented 5 years ago

Thanks, relevant issues are: #18262, #21948 and PR #21954 Tried to add some discussion there, but so far no reaction that's why I opened a new issue.

jreback commented 5 years ago

my point is to summarize the arguments in this issue (for and against)

st-bender commented 5 years ago

Hi, Thank you for your input and setting the labels and sorry for getting back to you so late. I am not really sure that I understood correctly, my point is that #18262 only lists a number of API functions to be deprecated, i.e. to be removed in the future, without further reasoning. In my opinion any change to the official and documented API should be justified. After all, pandas calls itself stable.

Note that I am not really opposed to keeping both functions, so the title may be a little provocative and could be adjusted.

My points pro keeping read_table():

First of all to keep the API stable and to not break existing code.
Keep the function with the more general name (IMHO) over the special case, i.e. read_table() to read ascii files with any delimiter/separator over read_csv() with the connotation of a fixed delimiter, the comma. Renaming that function to read_txt() may be worthwhile considering, although it contradicts the first point above.
Two lines of additional code, please correct me if I am mistaken.

Contra (as far as I can see):

Two lines of extra code.

tpall commented 5 years ago

Using read_csv() to import files with any delimiter seems confusing. Regarding naming, read_delim() (as in R readr package) is good alternative to read_table() or read_txt().

TomAugspurger commented 5 years ago

Keep the function with the more general name (IMHO) over the special case

Keep in mind that the stdlib provides this under the csv module. Shouldn't we be consistent with the stdlib? If you're coming from a Python backgound, wouldn't you look there first?

Two lines of extra code.

All else equal, we prefer not to keep around aliases. They tend to make the library harder to learn and use.

st-bender commented 5 years ago

Keep the function with the more general name (IMHO) over the special case

Keep in mind that the stdlib provides this under the csv module. Shouldn't we be consistent with the stdlib? If you're coming from a Python backgound, wouldn't you look there first?

Good point, I was actually used to numpy's genfromtxt(), so I was looking for something like that. I am not used to csv (files and module) that much. So in my case that's probably why I missed that connection.

Two lines of extra code.

All else equal, we prefer not to keep around aliases. They tend to make the library harder to learn and use.

Fair enough, and I would agree as a developer. However, you could still refer one to the other and state that they are kept for backwards compatibility. Keep in mind that any part of a stable API that is removed will break code. Not everybody may keep track of changes in the underlying libraries and modules. Imaging someone updates his or her setup at some point in the future for bugfixes and new or improved features. Then he or she notices that one or more of the packages in the setup uses that function (or any other that you are going to remove) which then breaks that package and in turn his own code. Those bugs are nasty to track down, and may lead to people not updating packages or rolling back updates.

TomAugspurger commented 5 years ago

This is just deprecated right now, and will be for some time, to give the community time to catch up.

Pandas is still evolving rapidly enough that I'd be pleasantly surprised if read_table being dropped was the hardest part of upgrading between two distant versions.

On Wed, Mar 6, 2019 at 3:12 PM Stefan Bender notifications@github.com wrote:

Keep the function with the more general name (IMHO) over the special case

Keep in mind that the stdlib provides this under the csv module. Shouldn't we be consistent with the stdlib? If you're coming from a Python backgound, wouldn't you look there first?

Good point, I was actually used to numpy's genfromtxt(), so I was looking for something like that. I am not used to csv (files and module) that much. So in my case that's probably why I missed that connection.

Two lines of extra code.

All else equal, we prefer not to keep around aliases. They tend to make the library harder to learn and use.

Fair enough, and I would agree as a developer. However, you could still refer one to the other and state that they are kept for backwards compatibility. Keep in mind that any part of a stable API that is removed will break code. Not everybody may keep track of changes in the underlying libraries and modules. Imaging someone updates his or her setup at some point in the future for bugfixes and new or improved features. Then he or she notices that one or more of the packages in the setup uses that function (or any other that you are going to remove) which then breaks that package and in turn his own code. Those bugs are nasty to track down, and may lead to people not updating packages or rolling back updates.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/25220#issuecomment-470278569, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHInsAqlRmYxqEv8mLZ9jr11yxUBn-ks5vUC8_gaJpZM4aqRXG .

WillAyd commented 5 years ago

I know we've had some discussions around read_table which is fine to keep ongoing, but I would be a HUGE -1 on this. I would be inclined to believe that the VAST majority of users use read_csv instead of read_table so doing this would be a nightmare for the sake of semantics.

Here's the top 5 search results for me for "pandas tutorial" excluding pandas own documentation (this is by no means an endorsement of any of these just sharing as reference):

All use read_csv, none use read_table. Again I understand the merits of cleaning up semantics here, but from a pure usage analysis and effort trade-off I would be very against doing this

st-bender commented 5 years ago

This is just deprecated right now, and will be for some time, to give the community time to catch up.

Isn't the whole point to remove it in the future? Unless I misunderstood the point of deprecating something. So anybody who missed the deprecation warning for whatever reason will be surprised.

Pandas is still evolving rapidly enough that I'd be pleasantly surprised if read_table being dropped was the hardest part of upgrading between two distant versions.

I understand that, but I am tempted to take that bet. I was actually not talking only about read_table() but also about all those other functions that got deprecated recently. Apparently we have different opinions on 'Development Status :: 5 - Production/Stable' For me it means that it's save to use and will be compatible for a long time into the future without breaking things. I am thinking about 10 to 20 years or even more.

st-bender commented 5 years ago

I know we've had some discussions around read_table which is fine to keep ongoing, but I would be a HUGE -1 on this. I would be inclined to believe that the VAST majority of users use read_csv instead of read_table so doing this would be a nightmare for the sake of semantics.

Here's the top 5 search results for me for "pandas tutorial" excluding pandas own documentation (this is by no means an endorsement of any of these just sharing as reference):

https://www.dataquest.io/blog/pandas-python-tutorial/

https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python

https://data36.com/pandas-tutorial-1-basics-reading-data-files-dataframes-data-selection/

https://data-flair.training/blogs/pandas-tutorial/

https://www.learnpython.org/en/Pandas_Basics

All use read_csv, none use read_table. Again I understand the merits of cleaning up semantics here, but from a pure usage analysis and effort trade-off I would be very against doing this

They use read_csv for reading .csv which is what it is designed for and named after. I am talking about reading space/tab delimited files, for which read_csv() is misnamed and not the first place I would look for (and didn't look). The title was meant to provoke discussion, otherwise you may not have stepped in. ;) The only issue I actually have is changing the API in a backwards-incompatible way, i.e. removing an actually used function from it, which will break other people's code. Granted it may not be the vast majority but it's not negligible either. Yet on the other hand you introduce new interfaces like read_fmf() which does essentially the same, all three cases are handled by numpy's genfromtxt().

But I realize that we are fighting over two lines of code instead of fixing the almost 3000 real bugs. And I can't help you with that unless you really keep your API stable and backwards compatible.

TomAugspurger commented 5 years ago

Thanks for the feedback. I think we have different opinions on this, which is fine.

On Thu, Mar 7, 2019 at 3:41 AM Stefan Bender notifications@github.com wrote:

I know we've had some discussions around read_table which is fine to keep ongoing, but I would be a HUGE -1 on this. I would be inclined to believe that the VAST majority of users use read_csv instead of read_table so doing this would be a nightmare for the sake of semantics.

Here's the top 5 search results for me for "pandas tutorial" excluding pandas own documentation (this is by no means an endorsement of any of these just sharing as reference):

https://www.dataquest.io/blog/pandas-python-tutorial/

https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python

https://data36.com/pandas-tutorial-1-basics-reading-data-files-dataframes-data-selection/

https://data-flair.training/blogs/pandas-tutorial/

https://www.learnpython.org/en/Pandas_Basics

All use read_csv, none use read_table. Again I understand the merits of cleaning up semantics here, but from a pure usage analysis and effort trade-off I would be very against doing this

They use read_csv for reading .csv which is what it is designed for and named after. I am talking about reading space/tab delimited files, for which read_csv() is misnamed and not the first place I would look for (and didn't look). The title was meant to provoke discussion, otherwise you may not have stepped in. ;) The only issue I actually have is changing the API in a backwards-incompatible way, i.e. removing an actually used function from it, which will break other people's code. Granted it may not be the vast majority but it's not negligible either. Yet on the other hand you introduce new interfaces like read_fmf() which does essentially the same, all three cases are handled by numpy's genfromtxt().

But I realize that we are fighting over two lines of code instead of fixing the almost 3000 real bugs. And I can't help you with that unless you really keep your API stable and backwards compatible.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/25220#issuecomment-470455510, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHItOvF37K6ZXbwSjbKjUTsOBpoPQ2ks5vUN7egaJpZM4aqRXG .

jimmywan commented 5 years ago

All else equal, we prefer not to keep around aliases. They tend to make the library harder to learn and use.

This makes sense if the aliases are all superfluous and the base function has a reasonable name. You can't reasonably make that argument here.

This is just deprecated right now, and will be for some time, to give the community time to catch up.

So deprecate read_csv and give them a healthy window to catch up. Giving general functions specific names is bad design. If you care at all about having a clean API, the decision to choose read_csv over read_table was absolutely backwards.

I would be inclined to believe that the VAST majority of users use read_csv instead of read_table so doing this would be a nightmare for the sake of semantics.

Arguments like those are akin to saying progress should never be made for the sake of stability. At the very least read_table should be un-deprecated.

TomAugspurger commented 5 years ago

You can't reasonably make that argument here.

Why do you say that? The stdlib uses the "csv" module for reading delimited files.

On Thu, Mar 7, 2019 at 3:17 PM Jimmy Wan notifications@github.com wrote:

All else equal, we prefer not to keep around aliases. They tend to make the library harder to learn and use.

This makes sense if the aliases are all superfluous and the base function has a reasonable name. You can't reasonably make that argument here.

This is just deprecated right now, and will be for some time, to give the community time to catch up.

So deprecate read_csv and give them a healthy window to catch up. Giving general functions specific names is bad design. If you care at all about having a clean API, the decision to choose read_csv over read_table was absolutely backwards.

I would be inclined to believe that the VAST majority of users use read_csv instead of read_table so doing this would be a nightmare for the sake of semantics.

Arguments like those are akin to saying progress should never be made for the sake of stability. At the very least read_table should be un-deprecated.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/25220#issuecomment-470697994, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIgLoOEgvvGDo8_M_Xkxk_WviesImks5vUYIAgaJpZM4aqRXG .

WillAyd commented 5 years ago

Note that I am not really opposed to keeping both functions, so the title may be a little provocative and could be adjusted.

The title was meant to provoke discussion, otherwise you may not have stepped in. ;)

I understand there are some frustrations with the deprecation of read_table and that conversation can certainly continue. However, if the request to deprecate read_csv is frivolous than I would at least ask that you update the title to reflect what you are actually asking for (ex: "Undeprecate read_table") instead of what it currently is, as otherwise we'll continue down tangential paths that aren't helpful regardless of anyone's viewpoint on read_table.

But I realize that we are fighting over two lines of code instead of fixing the almost 3000 real bugs. And I can't help you with that unless you really keep your API stable and backwards compatible.

Pandas is an open source tool built off of volunteer hours. We would gladly welcome contributions from you to improve the tool. Otherwise I'd just like to call out that incendiary and insulting comments don't contribute anything to the tool or the community surrounding it.

jimmywan commented 5 years ago

Note that I am not really opposed to keeping both functions, so the title may be a little provocative and could be adjusted.

I understand there are some frustrations with the deprecation of read_table and that conversation can certainly continue. However, if the request to deprecate read_csv is frivolous than I would at least ask that you update the title to reflect what you are actually asking for (ex: "Undeprecate read_table") instead of what it currently is, as otherwise we'll continue down tangential paths that aren't helpful regardless of anyone's viewpoint on read_table.

I'm not the OP, but I think some of us are trying to say that deprecating read_csv in favor of read_table is defensible in terms of API clarity/readability with the side effect of some end-user pain. Deprecating read_table in favor of read_csv just seems misguided and wrong.

You can't reasonably make that argument here.

Why do you say that? The stdlib uses the "csv" module for reading delimited files.

That's more of a historical artifact and/or implementation detail. It's highly unlikely that a pandas end user knows and/or cares about it. No reason to propagate that decision any further.

Sidenote: The API choices by readr from the tidyverse to make read_csv/read_tsv convenience aliases for the more general read_delim ("delimited") seems like a more consistent choice. https://readr.tidyverse.org/reference/read_delim.html

TomAugspurger commented 5 years ago

On Mar 7, 2019, at 16:54, Jimmy Wan notifications@github.com wrote:

That's more of a historical artifact and/or implementation detail. It's highly unlikely that a pandas end user knows and/or cares about it.

Do you have a source for that?

jimmywan commented 5 years ago

That's more of a historical artifact and/or implementation detail. It's highly unlikely that a pandas end user knows and/or cares about it.

Do you have a source for that?

I'm not sure that I understand your question or the motivation behind it, but...

The original PEP was written > 15 years ago and it's fairly clear that the specific problem they were trying to solve is handling all of the annoying corner cases involved with a poorly defined file format that is ill conceived. Using a delimiter that shows up all over the place in every day life and real world content is just begging to generate corner cases, and that it has, with all of the crazy rules you have to put in place to deal with commas or single/double quotes. There is a passing reference to trying to make it be something slightly more general in support of Excel and excel-generated content, because that's been a tool of choice of real world users for decades. There's really no mention of trying to specifically "name" it to reflect being a general purpose parser building block or anything of that sort. https://www.python.org/dev/peps/pep-0305/#id13

As mentioned in the PEP, it was largely based upon a library named csv specifically built to parse CSV files: http://www.object-craft.com.au/projects/csv/

Maybe we just agree to disagree ¯\(ツ)/¯, but none of that really has anything to do with whether we, the "royal" we ;), in 2019, should be following a naming convention created in 2003 for an entirely different problem.

P.S. huge fan of your work with dask/dask-ml!

TomAugspurger commented 5 years ago

Thanks for linking to that PEP. I hadn’t seen it before.

On Mar 7, 2019, at 18:44, Jimmy Wan notifications@github.com wrote:

That's more of a historical artifact and/or implementation detail. It's highly unlikely that a pandas end user knows and/or cares about it.

Do you have a source for that?

I'm not sure that I understand your question or the motivation behind it, but...

The original PEP was written > 15 years ago and it's fairly clear that the specific problem they were trying to solve is handling all of the annoying corner cases involved with a poorly defined file format that is ill conceived. Using a delimiter that shows up all over the place in every day life and real world content is just begging to generate corner cases, and that it has, with all of the crazy rules you have to put in place to deal with commas or single/double quotes. There is a passing reference to trying to make it be something slightly more general in support of Excel and excel-generated content, because that's been a tool of choice of real world users for decades. There's really no mention of trying to specifically "name" it to reflect being a general purpose parser building block or anything of that sort. https://www.python.org/dev/peps/pep-0305/#id13

As mentioned in the PEP, it was largely based upon a library named csv specifically built to parse CSV files: http://www.object-craft.com.au/projects/csv/

Maybe we just agree to disagree ¯(ツ)/¯, but none of that really has anything to do with whether we, the "royal" we ;), in 2019, should be following a naming convention created in 2003 for an entirely different problem.

P.S. huge fan of your work with dask/dask-ml!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

st-bender commented 5 years ago

Note that I am not really opposed to keeping both functions, so the title may be a little provocative and could be adjusted.

The title was meant to provoke discussion, otherwise you may not have stepped in. ;)

I understand there are some frustrations with the deprecation of read_table and that conversation can certainly continue. However, if the request to deprecate read_csv is frivolous than I would at least ask that you update the title to reflect what you are actually asking for (ex: "Undeprecate read_table") instead of what it currently is, as otherwise we'll continue down tangential paths that aren't helpful regardless of anyone's viewpoint on read_table.

Did that, is it better now?

But I realize that we are fighting over two lines of code instead of fixing the almost 3000 real bugs. And I can't help you with that unless you really keep your API stable and backwards compatible.

Pandas is an open source tool built off of volunteer hours. We would gladly welcome contributions from you to improve the tool. Otherwise I'd just like to call out that incendiary and insulting comments don't contribute anything to the tool or the community surrounding it.

Sorry, I didn't mean to insult anybody. My problem is that I write code to do my my work and want to make sure that it still works in a few years from now when maybe someone else takes it over, or wants to reproduce my results. I can't do that if the API is constantly changing, or I would have to cap the max versions on the dependencies. So I am looking for solutions that are --- hopefully --- future proof.

nspies commented 5 years ago

Add my vote to keeping read_table around!

I've been a pandas user for 7+ years and have used read_table(...,sep="\t") where necessary. I agree with OP that I far prefer the read_table form as it sounds more general.

I will note that the documentation supports this interpretation: read_table is a function to "Read general delimited file into DataFrame." while read_csv is to "Read a comma-separated values (csv) file into DataFrame."

I have to say in my corner of datasciences, no one has touched the standard csv module, while everyone uses pandas, so I think it should decide for itself what's right. (A better point of comparison is probably R, which has both a read.csv and read.table function, where read.csv is an alias for the more general-purposeread.table.)

I generally agree with minimizing aliases, but this is at the core of what pandas does and is a very widespread alias. While "pd.read_csv" is clearly more commonly used -- I wouldn't suggest getting rid of it! -- a quick unscientific github search yields 50k+ results for "pd.read_table".

tzuni commented 5 years ago

In bioinformatics I have to deal with a LOT of tab-delimited data formats and almost never encounter comma-delimited data so this change has made a very prominent impact in my daily use. I too would like to see read_table stick around.

In my view implementing a general case read_delim function would make sense with read_csv and read_table becoming convenience functions to it. This would simplify code maintenance while providing a convenient, more readable, and stable API for two common use-cases.

WillAyd commented 5 years ago

@pandas-dev/pandas-core I know this was talked about during the call yesterday. FWIW I actually like the read_delim base with read_csv and read_table simply wrapping that as described by @jimmywan so I'd be +1 for keeping this around and moving toward that paradigm in the long run

jorisvandenbossche commented 5 years ago

I just saw this in the enthought cheatsheet passing on Twitter (https://www.enthought.com/wp-content/uploads/Enthought-Python-Pandas-Cheat-Sheets-1-8-v1.0.2.pdf). They use read_table as the main example of an IO method. And from the analysis in https://github.com/Quansight-Labs/python-api-inspect/ (but don't know which libraries / notebooks they analysed), read_table is after read_csv the most popular read_ method, before things like read_excel and read_hdf.

Anyway, seeing those made me think about this issue. For me it are additional hints that it might be worth actually keeping the read_table name.

datapythonista commented 5 years ago

Sorry to re-emphasize it here, but I think we'll have an increasing demand to add I/O modules, and I don't think it should be our job to maintain the code to import and export from every possible format out there.

I made my proposal in #26804, which surely is not perfect, but I think we should find a way for third-party developers to implement their own connectors and let them integrate with pandas in the best way (for the developers of the libraries, and for the users, creating consistent APIs for both).

jorisvandenbossche commented 5 years ago

That's for sure a good discussion to have, but I think not very related to the question: do we keep read_table being available by default in the top-level namespace?

datapythonista commented 5 years ago

Sorry, you're right, I wouldn't keep read_table.

What I meant with my previous comment is that I'd let users have read_table as a third-party module, and that I'd facilitate that happening (since we'd have the read_table code in pandas, it's not a great example of proper IO module, but more an alias, but still think I think make sense).

jorisvandenbossche commented 5 years ago

I would say that is an argument to keep it now (for 0.25.0), so that we could afterwards say: "if you want to keep using it, install this package".

datapythonista commented 5 years ago

I'm ok with that too. But since it's already deprecated, I'd probably keep it deprecated (I don't think the discussion is about removing it, or is it?)

st-bender commented 5 years ago

Hi,

Sorry, you're right, I wouldn't keep read_table.

Thanks for your opinion, but I think it would be appreciated if you could elaborate on why you think that the arguments in this issue in favour of keeping it are not relevant. In my case I would rather say "I wouldn't keep read_csv()". But this has been already discussed above, and a solution was proposed to keep both people happy.

datapythonista commented 5 years ago

it would be appreciated if you could elaborate on why you think that the arguments in this issue in favour of keeping it are not relevant

Sorry if it wasn't clear for you. read_table has already been deprecated. As pandas maintainers we'll be happy to not maintain two function that do the same. And users will appreciate not having to learn about two functions that do the same. read_table or read_txt may be more appropriate names, I can agree on that. But there is no way we're deprecating read_csv, is by far the most used function in pandas (based on views of the documentation pages).

The main point of my previous comment, is that we should make it easy for people like you to implement your own I/O connectors, and use them as if they were provided by pandas. So, in this case you'll be able to maintain a third-party package, providing read_table / to_table functionality, keep your code base as it is, by just installing and importing a module. And we won't have maintain code we don't want to maintain. I think that solution the best for everybody, besides the (in my opinion) all the other benefits detailed in #26804.

st-bender commented 5 years ago

it would be appreciated if you could elaborate on why you think that the arguments in this issue in favour of keeping it are not relevant

Sorry if it wasn't clear for you. read_table has already been deprecated. As pandas maintainers we'll be happy to not maintain two function that do the same.

I know that, that's why I opened this issue because I think it is a step into the wrong direction. And if you followed the discussion, my opinion may not reflect the majority, but I am not alone either.

And users will appreciate not having to learn about two functions that do the same. read_table or read_txt may be more appropriate names, I can agree on that.

As I user I would appreciate a stable API, please stick to the discussion here, all the points have already been raised.

But there is no way we're deprecating read_csv, is by far the most used function in pandas (based on views of the documentation pages).

See above, please read the discussion in the current issue carefully.

The main point of my previous comment, is that we should make it easy for people like you to implement your own I/O connectors, and use them as if they were provided by pandas.

If I have to do that, then pandas is useless (for me). Its IO facility and quite intuitive API was the main reason for me to use it at all. If you are going to remove functionality from the main API, I believe that will only hurt the people using pandas. And deprecation as I understand it, is a warning that a function will be removed in future versions.

So, in this case you'll be able to maintain a third-party package, providing read_table / to_table functionality, keep your code base as it is, by just installing and importing a module. And we won't have maintain code we don't want to maintain. I think that solution the best for everybody, besides the (in my opinion) all the other benefits detailed in #26804.

Thank you, but I have no interest in that. As I user, I want to use a module, not program around functions that have been removed for no good reason The linked issue is an implementation issue, not API related, and therefore separate from the current issue. However you implement the details is of no concern to the user.

Edit: I realize that the above sounds like a demand without any contribution, please don't take it personally, I appreciate all the work from the maintainers. I just think that keeping an API stable will have more benefits in the long run.

jorisvandenbossche commented 5 years ago

We decided at the sprint here to keep it for now, so to remove the deprecation. PR at https://github.com/pandas-dev/pandas/pull/27102

jondo commented 4 years ago

I just want to note that CSV is also very often used in the more general meaning "character-separated values". Maybe the documentation of read_csvshould switch to that definition?

pandas-dev / pandas

Keep (un-deprecate) `read_table()` for API stability #25220

Problem description

Proposed solution