pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.09k stars 17.73k forks source link

API: Standardize usage of underscores in multi-word kwargs #22639

Open palewire opened 5 years ago

palewire commented 5 years ago

This ticket is an outgrowth of a discussion in pull request #22587

By my rough count, the read_csv method has nearly 50 keyword arguments.

Of those, 32 arguments are made up or two or more words. Twenty of those multi-word arguments use an underscore to mark the space between words, like skip_blank_lines and parse_dates. Twelve do not, like chunksize and lineterminator.

It is my opinion this is a small flaw in pandas' API, and that the library would benefit by standardizing how spaces are handled. It would make pandas more legible and consistent, and therefore easier for users of all experience levels.

I have taught pandas to dozens of newbies across the country and I can testify from experience that small variations in the naming style of commonly used methods introduces unnecessary frustration, and can even reduce user confidence in the quality of the overall product.

As a frequent user of pandas, I can also attest that the inconsistencies require me, someone who uses the library daily, to routinely consult the documentation to ensure I use the proper kwarg naming style.

I am sympathetic to the desire to maintain backwards compatibility, which I believe could be managed with deprecation warnings that, if included, could be temporary, and ultimately removed in a future version, much in the way sort_values was introduced.

Since the underscore method of handling word breaks is more common and more legible, I propose it be adopted. All existing multi-word arguments without an underscore would need to be modified. You can find an experimental patch of the skiprows kwargs, and considerable support from other users for pursuing this type of change, in #22587.

If that pull request is ultimately merged, and the maintainers agree with the larger goal I've tried to articulate here, I would be pleased to lead an effort to expand whatever design pattern is agreed upon to other keyword arguments across the library.

palewire commented 5 years ago

cc: @gfyoung @jreback xref: #13349

gfyoung commented 5 years ago

@palewire : Thanks opening this! After some thought, I think underscores would make sense for readability purposes as you said.

@jreback : Thoughts?

jreback commented 5 years ago

as i said in the issue pls show a matrix of all the proposed changes

palewire commented 5 years ago

Here's a first pass.

input two_words underscore proposed_change deprecate
filepath_or_buffer TRUE TRUE -  
sep FALSE   - TRUE
delimiter FALSE   -  
delim_whitespace TRUE TRUE delimiter_whitespace  
header FALSE   -  
names FALSE   -  
index_col TRUE TRUE -  
usecols TRUE FALSE use_cols  
squeeze FALSE   -  
prefix FALSE   -  
mangle_dupe_cols TRUE TRUE -  
dtype TRUE FALSE -  
engine FALSE   -  
converters FALSE   -  
true_values TRUE TRUE -  
false_values TRUE TRUE -  
skipinitialspace TRUE FALSE skip_initial_space  
skiprows TRUE FALSE skip_rows  
skipfooter TRUE FALSE skip_footer  
nrows TRUE FALSE n_rows  
na_values TRUE TRUE -  
keep_default_na TRUE TRUE -  
na_filter TRUE TRUE -  
verbose FALSE   -  
skip_blank_lines TRUE TRUE -  
parse_dates TRUE TRUE -  
infer_datetime_format TRUE TRUE -  
keep_date_col TRUE TRUE -  
date_parser TRUE TRUE -  
dayfirst TRUE FALSE day_first  
iterator FALSE   -  
chunksize TRUE FALSE chunk_size  
compression FALSE   -  
thousands FALSE   -  
decimal FALSE   -  
float_precision TRUE TRUE -  
lineterminator TRUE FALSE line_terminator  
quotechar TRUE FALSE quote_character  
quoting FALSE   -  
doublequote TRUE FALSE double_quote  
escapechar TRUE FALSE escape_character  
comment FALSE   -  
encoding FALSE   -  
dialect FALSE   -  
tupleize_cols TRUE TRUE -  
error_bad_lines TRUE TRUE -  
warn_bad_lines TRUE TRUE -  
low_memory TRUE TRUE -  
memory_map TRUE TRUE -  
jreback commented 5 years ago

can u add another column about disposal either deprecation - or keep duplication

palewire commented 5 years ago

I'd be happy to make that work part of this ticket, but I'm not sure I have firm enough opinions about which kwargs ought to be deprecated. I'd nominate sep due its duplication by delimiter, if pressed.

jreback commented 5 years ago

right that’s the whole point here - to decide these things

palewire commented 5 years ago

It's done

On Tue, Sep 11, 2018, 4:46 AM Jeff Reback notifications@github.com wrote:

right that’s the whole point here - to decide these things

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/22639#issuecomment-420244939, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAnCUAMVdt57-zip-3GNKWr7oX5cX-Uks5uZ6KIgaJpZM4WgBCv .

jreback commented 5 years ago

@palewire looks like he name column is hidden?

palewire commented 5 years ago

I've posted that fix. Thank you for your patience.

h-vetinari commented 5 years ago

+1 on the whole thing, -1 on deprecating sep (that's ubiquitous in string methods like python str.split, pandas Series.str.cat, etc.)

While we're at it, I would also prefer a better name for mangle_dupe_cols, and instead of two booleans for error_bad_lines/warn_bad_lines, there could be a error_bad_lines = {'raise'|'warn'|'ignore'}.

gfyoung commented 5 years ago

@h-vetinari : The error_bad_lines is certainly worth considering. Let's open that in a separate issue and see what the consensus is on that.

gfyoung commented 5 years ago

While we're at it, I would also prefer a better name for mangle_dupe_cols

@h-vetinari : Suggestions? I find it pretty descriptive IMO.

h-vetinari commented 5 years ago

@gfyoung

While we're at it, I would also prefer a better name for mangle_dupe_cols

@h-vetinari : Suggestions? I find it pretty descriptive IMO.

Can't think of something shorter off the top of my head. My thoughts are that "mangle" (and "dupe") is not an easy word for non-native speakers, and it sounds much more informal than the rest of the kwargs.

Edit: maybe something like force_unique_cols, but that would need the boolean complement to mangle_dupe_cols. Actually, I find the description of the kwarg

Duplicate columns will be specified as ‘X’, ‘X.1’, …’X.N’, rather than ‘X’…’X’. Passing in False will cause data to be overwritten if there are duplicate names in the columns.

to be unclear - first it says "rather than ‘X’…’X’", but then it says duplicates are dropped? Maybe it would be clearer to have something like keep_duplicate_cols?

gfyoung commented 5 years ago

Maybe it would be clearer to have something like keep_duplicate_cols?

@h-vetinari : Hmm...not sure. That has the downside of not being clear as to what happens when keep_duplicate_cols is True. I wouldn't worry about this one too much. We currently don't support mangle_dupe_cols=False, so it's more important that users know what happens when it's True.

mroeschke commented 5 years ago

Small note: I presume the reason why dayfirst is one word is because the same keyword argument exists in to_datetime (and the parsers use to_datetime internally too): https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html

So if dayfirst were to change to day_first, the same change should be made in to_datetime as well (and maybe other places I am not remembering of the top of my head)

jorisvandenbossche commented 5 years ago

Note we discussed this in https://github.com/pandas-dev/pandas/pull/13386 (in the PR, the issue was already linked above), and basically we then decided this was not worth it (which of course does not mean it cannot be reconsidered, but it is worth looking at that discussion)

palewire commented 5 years ago

@jorisvandenbossche, I'd encourage you to consider the response from users in #22587. There is a desire out there for "tidying" of this sort. IMHO, it is worth the work.

gfyoung commented 5 years ago

Note we discussed this in #13386 (in the PR, the issue was already linked above), and basically we then decided this was not worth it

@jorisvandenbossche : That is a fair point. One consideration is that with 1.0 on the horizon, does it make sense to keep things as they are for "historical purposes" ? That was part of the push-back when it was first floated in #13386.

jorisvandenbossche commented 5 years ago

@palewire Yes, I read that thread. But there is also the desire of users to have code that keeps working (who are maybe less vocal here)

There are many warts in pandas I would like to see cleaned up, but many of them we nevertheless keep for historical reasons, and fixing them would have a too big impact.

Now, we can also think about ways that have less impact. In the previous discussion we mentioned keeping the old ones as aliases instead of deprecating ("deprecate by documentation"). But of course, seeing them different ones used can also create confusion.

palewire commented 5 years ago

@jorisvandenbossche, I am also sensitive to the need for backwards compatibility. That's why my exploratory patch in the pull request included an effort to continue to support the old keyword namespace and interject a warning of potential future deprecation for the user.

Taking that step and standardizing the names in read_csv would follow clearly in the footsteps of what's been done for other methods, such as read_excel.

screenshot from 2018-09-17 14-32-53

My goal here is to simply extend the logic of patches like that across a wider stretch of methods. If we proceed along those lines, without deprecating any keywords entirely, would you object?

dhimmel commented 5 years ago

I agree that underscore spacing for multi-word arguments is most pythonic and readable. Hopefully, this issue will help ensure new keyword arguments use underscore spacing.

Upgrading existing arguments will be painful, both in terms of developer time and the deprecation / backwards incompatibility issues that will arise. On the other hand, if these changes are going to be made, then sooner is better. I lean on the side of continually improving the Pandas design, since data science is a quickly-developing field.

It seems like disruptive changes like this would be most natural for a pandas overhaul like Pandas 2.0. However, it's not clear whether Pandas 2 is an active proposal? If not, perhaps it makes sense to bite the bullet now on standardizing existing argument names.

TomAugspurger commented 5 years ago

@palewire would you have time to update your list to indicate whether a kwarg is mimicking a keyword from another library (e.g. dayfirst python-dateutil, quotechar char from the stdlib) and whether the keyword is maybe used in other places in pandas (e.g. chunksize)

palewire commented 5 years ago

@TomAugspurger, are you of the opinion that certain kwargs should not be given an underscore if they are compound words in other libraries or elsewhere in pandas?

TomAugspurger commented 5 years ago

Yes. I think the value of matching the stdlib in cases like quotechar outweighs the value of having all our keywords be consistent.

Things like dayfirst are less clear since the fact that we're using python-dateutil in that case is mostly a (documented) implementation detail.

palewire commented 5 years ago

I'm happy to do the research, though I'm not sure I fully agree. Some kind of scope should be set on that rule, though, shouldn't it? Keywords used in the standard Python library?

TomAugspurger commented 5 years ago

I don't think there's a clear dividing line here. It's just a values judgement.

On Tue, Oct 2, 2018 at 11:16 AM Ben Welsh notifications@github.com wrote:

I'm happy to do the research, though I'm not sure I fully agree. Some kind of scope should be set on that rule, though, shouldn't it? Keywords used in the standard Python library?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/22639#issuecomment-426335929, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIjqSte7BvR5grHfeI2Naq1nNiF-Yks5ug5FhgaJpZM4WgBCv .

minggli commented 5 years ago

Hi added PR #23158 in regards to deprecating delimiter on read_csv following previous discussion.