API: Standardize usage of underscores in multi-word kwargs

palewire commented 5 years ago

This ticket is an outgrowth of a discussion in pull request #22587

By my rough count, the read_csv method has nearly 50 keyword arguments.

Of those, 32 arguments are made up or two or more words. Twenty of those multi-word arguments use an underscore to mark the space between words, like skip_blank_lines and parse_dates. Twelve do not, like chunksize and lineterminator.

It is my opinion this is a small flaw in pandas' API, and that the library would benefit by standardizing how spaces are handled. It would make pandas more legible and consistent, and therefore easier for users of all experience levels.

I have taught pandas to dozens of newbies across the country and I can testify from experience that small variations in the naming style of commonly used methods introduces unnecessary frustration, and can even reduce user confidence in the quality of the overall product.

As a frequent user of pandas, I can also attest that the inconsistencies require me, someone who uses the library daily, to routinely consult the documentation to ensure I use the proper kwarg naming style.

I am sympathetic to the desire to maintain backwards compatibility, which I believe could be managed with deprecation warnings that, if included, could be temporary, and ultimately removed in a future version, much in the way sort_values was introduced.

Since the underscore method of handling word breaks is more common and more legible, I propose it be adopted. All existing multi-word arguments without an underscore would need to be modified. You can find an experimental patch of the skiprows kwargs, and considerable support from other users for pursuing this type of change, in #22587.

If that pull request is ultimately merged, and the maintainers agree with the larger goal I've tried to articulate here, I would be pleased to lead an effort to expand whatever design pattern is agreed upon to other keyword arguments across the library.

palewire commented 5 years ago

cc: @gfyoung @jreback xref: #13349

gfyoung commented 5 years ago

@palewire : Thanks opening this! After some thought, I think underscores would make sense for readability purposes as you said.

@jreback : Thoughts?

jreback commented 5 years ago

as i said in the issue pls show a matrix of all the proposed changes

palewire commented 5 years ago

Here's a first pass.

input	two_words	underscore	proposed_change	deprecate
filepath_or_buffer	TRUE	TRUE	-
sep	FALSE		-	TRUE
delimiter	FALSE		-
delim_whitespace	TRUE	TRUE	delimiter_whitespace
header	FALSE		-
names	FALSE		-
index_col	TRUE	TRUE	-
usecols	TRUE	FALSE	use_cols
squeeze	FALSE		-
prefix	FALSE		-
mangle_dupe_cols	TRUE	TRUE	-
dtype	TRUE	FALSE	-
engine	FALSE		-
converters	FALSE		-
true_values	TRUE	TRUE	-
false_values	TRUE	TRUE	-
skipinitialspace	TRUE	FALSE	skip_initial_space
skiprows	TRUE	FALSE	skip_rows
skipfooter	TRUE	FALSE	skip_footer
nrows	TRUE	FALSE	n_rows
na_values	TRUE	TRUE	-
keep_default_na	TRUE	TRUE	-
na_filter	TRUE	TRUE	-
verbose	FALSE		-
skip_blank_lines	TRUE	TRUE	-
parse_dates	TRUE	TRUE	-
infer_datetime_format	TRUE	TRUE	-
keep_date_col	TRUE	TRUE	-
date_parser	TRUE	TRUE	-
dayfirst	TRUE	FALSE	day_first
iterator	FALSE		-
chunksize	TRUE	FALSE	chunk_size
compression	FALSE		-
thousands	FALSE		-
decimal	FALSE		-
float_precision	TRUE	TRUE	-
lineterminator	TRUE	FALSE	line_terminator
quotechar	TRUE	FALSE	quote_character
quoting	FALSE		-
doublequote	TRUE	FALSE	double_quote
escapechar	TRUE	FALSE	escape_character
comment	FALSE		-
encoding	FALSE		-
dialect	FALSE		-
tupleize_cols	TRUE	TRUE	-
error_bad_lines	TRUE	TRUE	-
warn_bad_lines	TRUE	TRUE	-
low_memory	TRUE	TRUE	-
memory_map	TRUE	TRUE	-

jreback commented 5 years ago

can u add another column about disposal either deprecation - or keep duplication

palewire commented 5 years ago

I'd be happy to make that work part of this ticket, but I'm not sure I have firm enough opinions about which kwargs ought to be deprecated. I'd nominate sep due its duplication by delimiter, if pressed.

jreback commented 5 years ago

right that’s the whole point here - to decide these things

palewire commented 5 years ago

It's done

On Tue, Sep 11, 2018, 4:46 AM Jeff Reback notifications@github.com wrote:

right that’s the whole point here - to decide these things

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/22639#issuecomment-420244939, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAnCUAMVdt57-zip-3GNKWr7oX5cX-Uks5uZ6KIgaJpZM4WgBCv .

jreback commented 5 years ago

@palewire looks like he name column is hidden?

palewire commented 5 years ago

I've posted that fix. Thank you for your patience.

h-vetinari commented 5 years ago

+1 on the whole thing, -1 on deprecating sep (that's ubiquitous in string methods like python str.split, pandas Series.str.cat, etc.)

While we're at it, I would also prefer a better name for mangle_dupe_cols, and instead of two booleans for error_bad_lines/warn_bad_lines, there could be a error_bad_lines = {'raise'|'warn'|'ignore'}.

gfyoung commented 5 years ago

@h-vetinari : The error_bad_lines is certainly worth considering. Let's open that in a separate issue and see what the consensus is on that.

gfyoung commented 5 years ago

While we're at it, I would also prefer a better name for mangle_dupe_cols

@h-vetinari : Suggestions? I find it pretty descriptive IMO.

h-vetinari commented 5 years ago

@gfyoung

While we're at it, I would also prefer a better name for mangle_dupe_cols

@h-vetinari : Suggestions? I find it pretty descriptive IMO.

Can't think of something shorter off the top of my head. My thoughts are that "mangle" (and "dupe") is not an easy word for non-native speakers, and it sounds much more informal than the rest of the kwargs.

Edit: maybe something like force_unique_cols, but that would need the boolean complement to mangle_dupe_cols. Actually, I find the description of the kwarg

Duplicate columns will be specified as ‘X’, ‘X.1’, …’X.N’, rather than ‘X’…’X’. Passing in False will cause data to be overwritten if there are duplicate names in the columns.

to be unclear - first it says "rather than ‘X’…’X’", but then it says duplicates are dropped? Maybe it would be clearer to have something like keep_duplicate_cols?

gfyoung commented 5 years ago

Maybe it would be clearer to have something like keep_duplicate_cols?

@h-vetinari : Hmm...not sure. That has the downside of not being clear as to what happens when keep_duplicate_cols is True. I wouldn't worry about this one too much. We currently don't support mangle_dupe_cols=False, so it's more important that users know what happens when it's True.

mroeschke commented 5 years ago

Small note: I presume the reason why dayfirst is one word is because the same keyword argument exists in to_datetime (and the parsers use to_datetime internally too): https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html

So if dayfirst were to change to day_first, the same change should be made in to_datetime as well (and maybe other places I am not remembering of the top of my head)

jorisvandenbossche commented 5 years ago

Note we discussed this in https://github.com/pandas-dev/pandas/pull/13386 (in the PR, the issue was already linked above), and basically we then decided this was not worth it (which of course does not mean it cannot be reconsidered, but it is worth looking at that discussion)

palewire commented 5 years ago

@jorisvandenbossche, I'd encourage you to consider the response from users in #22587. There is a desire out there for "tidying" of this sort. IMHO, it is worth the work.

gfyoung commented 5 years ago

Note we discussed this in #13386 (in the PR, the issue was already linked above), and basically we then decided this was not worth it

@jorisvandenbossche : That is a fair point. One consideration is that with 1.0 on the horizon, does it make sense to keep things as they are for "historical purposes" ? That was part of the push-back when it was first floated in #13386.

jorisvandenbossche commented 5 years ago

@palewire Yes, I read that thread. But there is also the desire of users to have code that keeps working (who are maybe less vocal here)

There are many warts in pandas I would like to see cleaned up, but many of them we nevertheless keep for historical reasons, and fixing them would have a too big impact.

Now, we can also think about ways that have less impact. In the previous discussion we mentioned keeping the old ones as aliases instead of deprecating ("deprecate by documentation"). But of course, seeing them different ones used can also create confusion.

palewire commented 5 years ago

@jorisvandenbossche, I am also sensitive to the need for backwards compatibility. That's why my exploratory patch in the pull request included an effort to continue to support the old keyword namespace and interject a warning of potential future deprecation for the user.

Taking that step and standardizing the names in read_csv would follow clearly in the footsteps of what's been done for other methods, such as read_excel.

screenshot from 2018-09-17 14-32-53

My goal here is to simply extend the logic of patches like that across a wider stretch of methods. If we proceed along those lines, without deprecating any keywords entirely, would you object?

dhimmel commented 5 years ago

I agree that underscore spacing for multi-word arguments is most pythonic and readable. Hopefully, this issue will help ensure new keyword arguments use underscore spacing.

Upgrading existing arguments will be painful, both in terms of developer time and the deprecation / backwards incompatibility issues that will arise. On the other hand, if these changes are going to be made, then sooner is better. I lean on the side of continually improving the Pandas design, since data science is a quickly-developing field.

It seems like disruptive changes like this would be most natural for a pandas overhaul like Pandas 2.0. However, it's not clear whether Pandas 2 is an active proposal? If not, perhaps it makes sense to bite the bullet now on standardizing existing argument names.

TomAugspurger commented 5 years ago

@palewire would you have time to update your list to indicate whether a kwarg is mimicking a keyword from another library (e.g. dayfirst python-dateutil, quotechar char from the stdlib) and whether the keyword is maybe used in other places in pandas (e.g. chunksize)

palewire commented 5 years ago

@TomAugspurger, are you of the opinion that certain kwargs should not be given an underscore if they are compound words in other libraries or elsewhere in pandas?

TomAugspurger commented 5 years ago

Yes. I think the value of matching the stdlib in cases like quotechar outweighs the value of having all our keywords be consistent.

Things like dayfirst are less clear since the fact that we're using python-dateutil in that case is mostly a (documented) implementation detail.

palewire commented 5 years ago

I'm happy to do the research, though I'm not sure I fully agree. Some kind of scope should be set on that rule, though, shouldn't it? Keywords used in the standard Python library?

TomAugspurger commented 5 years ago

I don't think there's a clear dividing line here. It's just a values judgement.

On Tue, Oct 2, 2018 at 11:16 AM Ben Welsh notifications@github.com wrote:

I'm happy to do the research, though I'm not sure I fully agree. Some kind of scope should be set on that rule, though, shouldn't it? Keywords used in the standard Python library?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/22639#issuecomment-426335929, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIjqSte7BvR5grHfeI2Naq1nNiF-Yks5ug5FhgaJpZM4WgBCv .

minggli commented 5 years ago

Hi added PR #23158 in regards to deprecating delimiter on read_csv following previous discussion.

pandas-dev / pandas

API: Standardize usage of underscores in multi-word kwargs #22639