Open palewire opened 5 years ago
cc: @gfyoung @jreback xref: #13349
@palewire : Thanks opening this! After some thought, I think underscores would make sense for readability purposes as you said.
@jreback : Thoughts?
as i said in the issue pls show a matrix of all the proposed changes
Here's a first pass.
input | two_words | underscore | proposed_change | deprecate |
---|---|---|---|---|
filepath_or_buffer | TRUE | TRUE | - | |
sep | FALSE | - | TRUE | |
delimiter | FALSE | - | ||
delim_whitespace | TRUE | TRUE | delimiter_whitespace | |
header | FALSE | - | ||
names | FALSE | - | ||
index_col | TRUE | TRUE | - | |
usecols | TRUE | FALSE | use_cols | |
squeeze | FALSE | - | ||
prefix | FALSE | - | ||
mangle_dupe_cols | TRUE | TRUE | - | |
dtype | TRUE | FALSE | - | |
engine | FALSE | - | ||
converters | FALSE | - | ||
true_values | TRUE | TRUE | - | |
false_values | TRUE | TRUE | - | |
skipinitialspace | TRUE | FALSE | skip_initial_space | |
skiprows | TRUE | FALSE | skip_rows | |
skipfooter | TRUE | FALSE | skip_footer | |
nrows | TRUE | FALSE | n_rows | |
na_values | TRUE | TRUE | - | |
keep_default_na | TRUE | TRUE | - | |
na_filter | TRUE | TRUE | - | |
verbose | FALSE | - | ||
skip_blank_lines | TRUE | TRUE | - | |
parse_dates | TRUE | TRUE | - | |
infer_datetime_format | TRUE | TRUE | - | |
keep_date_col | TRUE | TRUE | - | |
date_parser | TRUE | TRUE | - | |
dayfirst | TRUE | FALSE | day_first | |
iterator | FALSE | - | ||
chunksize | TRUE | FALSE | chunk_size | |
compression | FALSE | - | ||
thousands | FALSE | - | ||
decimal | FALSE | - | ||
float_precision | TRUE | TRUE | - | |
lineterminator | TRUE | FALSE | line_terminator | |
quotechar | TRUE | FALSE | quote_character | |
quoting | FALSE | - | ||
doublequote | TRUE | FALSE | double_quote | |
escapechar | TRUE | FALSE | escape_character | |
comment | FALSE | - | ||
encoding | FALSE | - | ||
dialect | FALSE | - | ||
tupleize_cols | TRUE | TRUE | - | |
error_bad_lines | TRUE | TRUE | - | |
warn_bad_lines | TRUE | TRUE | - | |
low_memory | TRUE | TRUE | - | |
memory_map | TRUE | TRUE | - |
can u add another column about disposal either deprecation - or keep duplication
I'd be happy to make that work part of this ticket, but I'm not sure I have firm enough opinions about which kwargs ought to be deprecated. I'd nominate sep
due its duplication by delimiter
, if pressed.
right that’s the whole point here - to decide these things
It's done
On Tue, Sep 11, 2018, 4:46 AM Jeff Reback notifications@github.com wrote:
right that’s the whole point here - to decide these things
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/22639#issuecomment-420244939, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAnCUAMVdt57-zip-3GNKWr7oX5cX-Uks5uZ6KIgaJpZM4WgBCv .
@palewire looks like he name column is hidden?
I've posted that fix. Thank you for your patience.
+1 on the whole thing,
-1 on deprecating sep
(that's ubiquitous in string methods like python str.split
, pandas Series.str.cat
, etc.)
While we're at it, I would also prefer a better name for mangle_dupe_cols
, and instead of two booleans for error_bad_lines/warn_bad_lines
, there could be a error_bad_lines = {'raise'|'warn'|'ignore'}
.
@h-vetinari : The error_bad_lines
is certainly worth considering. Let's open that in a separate issue and see what the consensus is on that.
While we're at it, I would also prefer a better name for
mangle_dupe_cols
@h-vetinari : Suggestions? I find it pretty descriptive IMO.
@gfyoung
While we're at it, I would also prefer a better name for mangle_dupe_cols
@h-vetinari : Suggestions? I find it pretty descriptive IMO.
Can't think of something shorter off the top of my head. My thoughts are that "mangle" (and "dupe") is not an easy word for non-native speakers, and it sounds much more informal than the rest of the kwargs.
Edit: maybe something like force_unique_cols
, but that would need the boolean complement to mangle_dupe_cols
. Actually, I find the description of the kwarg
Duplicate columns will be specified as ‘X’, ‘X.1’, …’X.N’, rather than ‘X’…’X’. Passing in False will cause data to be overwritten if there are duplicate names in the columns.
to be unclear - first it says "rather than ‘X’…’X’", but then it says duplicates are dropped? Maybe it would be clearer to have something like keep_duplicate_cols
?
Maybe it would be clearer to have something like keep_duplicate_cols?
@h-vetinari : Hmm...not sure. That has the downside of not being clear as to what happens when keep_duplicate_cols
is True
. I wouldn't worry about this one too much. We currently don't support mangle_dupe_cols=False
, so it's more important that users know what happens when it's True
.
Small note: I presume the reason why dayfirst
is one word is because the same keyword argument exists in to_datetime
(and the parsers use to_datetime
internally too): https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html
So if dayfirst
were to change to day_first
, the same change should be made in to_datetime
as well (and maybe other places I am not remembering of the top of my head)
Note we discussed this in https://github.com/pandas-dev/pandas/pull/13386 (in the PR, the issue was already linked above), and basically we then decided this was not worth it (which of course does not mean it cannot be reconsidered, but it is worth looking at that discussion)
@jorisvandenbossche, I'd encourage you to consider the response from users in #22587. There is a desire out there for "tidying" of this sort. IMHO, it is worth the work.
Note we discussed this in #13386 (in the PR, the issue was already linked above), and basically we then decided this was not worth it
@jorisvandenbossche : That is a fair point. One consideration is that with 1.0
on the horizon, does it make sense to keep things as they are for "historical purposes" ? That was part of the push-back when it was first floated in #13386.
@palewire Yes, I read that thread. But there is also the desire of users to have code that keeps working (who are maybe less vocal here)
There are many warts in pandas I would like to see cleaned up, but many of them we nevertheless keep for historical reasons, and fixing them would have a too big impact.
Now, we can also think about ways that have less impact. In the previous discussion we mentioned keeping the old ones as aliases instead of deprecating ("deprecate by documentation"). But of course, seeing them different ones used can also create confusion.
@jorisvandenbossche, I am also sensitive to the need for backwards compatibility. That's why my exploratory patch in the pull request included an effort to continue to support the old keyword namespace and interject a warning of potential future deprecation for the user.
Taking that step and standardizing the names in read_csv
would follow clearly in the footsteps of what's been done for other methods, such as read_excel
.
My goal here is to simply extend the logic of patches like that across a wider stretch of methods. If we proceed along those lines, without deprecating any keywords entirely, would you object?
I agree that underscore spacing for multi-word arguments is most pythonic and readable. Hopefully, this issue will help ensure new keyword arguments use underscore spacing.
Upgrading existing arguments will be painful, both in terms of developer time and the deprecation / backwards incompatibility issues that will arise. On the other hand, if these changes are going to be made, then sooner is better. I lean on the side of continually improving the Pandas design, since data science is a quickly-developing field.
It seems like disruptive changes like this would be most natural for a pandas overhaul like Pandas 2.0. However, it's not clear whether Pandas 2 is an active proposal? If not, perhaps it makes sense to bite the bullet now on standardizing existing argument names.
@palewire would you have time to update your list to indicate whether a kwarg is mimicking a keyword from another library (e.g. dayfirst python-dateutil, quotechar char from the stdlib) and whether the keyword is maybe used in other places in pandas (e.g. chunksize)
@TomAugspurger, are you of the opinion that certain kwargs should not be given an underscore if they are compound words in other libraries or elsewhere in pandas?
Yes. I think the value of matching the stdlib in cases like quotechar
outweighs the value of having all our keywords be consistent.
Things like dayfirst
are less clear since the fact that we're using python-dateutil in that case is mostly a (documented) implementation detail.
I'm happy to do the research, though I'm not sure I fully agree. Some kind of scope should be set on that rule, though, shouldn't it? Keywords used in the standard Python library?
I don't think there's a clear dividing line here. It's just a values judgement.
On Tue, Oct 2, 2018 at 11:16 AM Ben Welsh notifications@github.com wrote:
I'm happy to do the research, though I'm not sure I fully agree. Some kind of scope should be set on that rule, though, shouldn't it? Keywords used in the standard Python library?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/22639#issuecomment-426335929, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIjqSte7BvR5grHfeI2Naq1nNiF-Yks5ug5FhgaJpZM4WgBCv .
Hi added PR #23158 in regards to deprecating delimiter
on read_csv following previous discussion.
This ticket is an outgrowth of a discussion in pull request #22587
By my rough count, the
read_csv
method has nearly 50 keyword arguments.Of those, 32 arguments are made up or two or more words. Twenty of those multi-word arguments use an underscore to mark the space between words, like
skip_blank_lines
andparse_dates
. Twelve do not, likechunksize
andlineterminator
.It is my opinion this is a small flaw in pandas' API, and that the library would benefit by standardizing how spaces are handled. It would make pandas more legible and consistent, and therefore easier for users of all experience levels.
I have taught pandas to dozens of newbies across the country and I can testify from experience that small variations in the naming style of commonly used methods introduces unnecessary frustration, and can even reduce user confidence in the quality of the overall product.
As a frequent user of pandas, I can also attest that the inconsistencies require me, someone who uses the library daily, to routinely consult the documentation to ensure I use the proper kwarg naming style.
I am sympathetic to the desire to maintain backwards compatibility, which I believe could be managed with deprecation warnings that, if included, could be temporary, and ultimately removed in a future version, much in the way
sort_values
was introduced.Since the underscore method of handling word breaks is more common and more legible, I propose it be adopted. All existing multi-word arguments without an underscore would need to be modified. You can find an experimental patch of the
skiprows
kwargs, and considerable support from other users for pursuing this type of change, in #22587.If that pull request is ultimately merged, and the maintainers agree with the larger goal I've tried to articulate here, I would be pleased to lead an effort to expand whatever design pattern is agreed upon to other keyword arguments across the library.