pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.78k stars 17.97k forks source link

DOC clarify that pct_change() scale is actually 0-1, i.e. fractional_change() #20752

Closed sunilshah closed 1 year ago

sunilshah commented 6 years ago

Code Sample, a copy-pastable example if possible

# Your code here
>>>import pandas as pd
>>>a=pd.Series([0,1,2,3,4])
>>> a.pct_change()
0         NaN
1         inf
2    1.000000
3    0.500000
4    0.333333
dtype: float64

Problem description

This is fraction change and not percent change.

Dictionary definition of percent is "One part in every hundred."

Either change the documentation to state that it is fractional change and not percent change with the formula (for backward compatibility) or rename it (to get rid of the obvious error).

Expected Output

a.pct_change()*100 Val 2017-12-31 NaN 2018-12-31 100.000000 % 2019-12-31 50.000000 % 2020-12-31 33.333333 % 2021-12-31 25.000000 %

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Darwin OS-release: 17.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.22.0 pytest: 3.5.0 pip: 10.0.0 setuptools: 39.0.1 Cython: None numpy: 1.14.2 scipy: 1.0.1 pyarrow: None xarray: None IPython: 6.3.1 sphinx: None patsy: None dateutil: 2.7.2 pytz: 2018.4 blosc: None bottleneck: None tables: 3.4.2 numexpr: 2.6.4 feather: None matplotlib: 2.2.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 1.0.1 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: 0.6.0 [
TomAugspurger commented 6 years ago

How would you re-phrase the first sentence? Do the examples not make it clear?

sunilshah commented 6 years ago

Hi TomAugspurger,

Are you asking about re-phrasing the title or the first sentence of the Problem Description in my post?

Which examples are you referring to?

TomAugspurger commented 6 years ago

The first sentence & examples in the docstring: http://pandas-docs.github.io/pandas-docs-travis/generated/pandas.Series.pct_change.html

On Thu, Apr 19, 2018 at 9:06 PM, sunilshah notifications@github.com wrote:

Hi TomAugspurger,

Are you asking about re-phrasing the title or the first sentence of the Problem Description in my post?

Which examples are you referring to?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/20752#issuecomment-382943291, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIt3woWkn7RnlSOmmMPcZbkk89FOIks5tqUKjgaJpZM4Tcu9g .

sunilshah commented 6 years ago

I would change "Percent Change" to "Fractional Change" for every occurrence.

The examples do imply fractional changes. When I looked at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pct_change.html

there were no examples.

TomAugspurger commented 6 years ago

I haven't heard the term "Fractional Change" before, is it standard?

The docs I linked to are the developer docs, with a recently improved docstring. The ones you saw were (understandably) for the currently released version.

Are the improved docs sufficiently clear?

On Thu, Apr 19, 2018 at 9:15 PM, sunilshah notifications@github.com wrote:

I would change "Percent Change" to "Fractional Change" for every occurrence.

The examples do imply fractional changes. When I looked at https://pandas.pydata.org/pandas-docs/stable/generated/ pandas.DataFrame.pct_change.html

there were no examples.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/20752#issuecomment-382945266, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIkHqnUs8TbQ4_uBDqpF9wwpi5cCHks5tqUS-gaJpZM4Tcu9g .

sunilshah commented 6 years ago

The dictionary meaning of fractional (adjective) is precisely that: pertaining to or being a fraction.

http://www.dictionary.com/browse/fractional https://www.merriam-webster.com/dictionary/fractional

It could also be called relative change. Percent change implies 100 times the fractional or relative change. Therefore the name and the description are misleading.

The improved docs are better than the original because of the examples. The user still has to go through the numbers in the examples to compute and figure out that the author really meant fractions and not percentage, even though it mentions percent eight times. IMHO it should be corrected.

jreback commented 6 years ago

this is the first time anyone has brought this up, even though this method has existed since the beginning of pandas. this is splitting hairs about the semantic meaning. This is a won't fix.

sunilshah commented 6 years ago

The only reason to consider a change (to the docs), is to help new users learn faster. That was my motivation to spend my time filing the request.

TomAugspurger commented 6 years ago

Thanks. I think we're OK with the current docs for now. We have this issue as a record in case someone else is confused, and we can change it then.

On Fri, Apr 20, 2018 at 10:27 AM, sunilshah notifications@github.com wrote:

The only reason to consider a change (to the docs), is to help new users a faster ramp in using pandas. That was my motivation to spend my time filing the request.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/20752#issuecomment-383132730, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIjb3wxCRbOUbjX9YcjpyKkOYkI3Mks5tqf5pgaJpZM4Tcu9g .

mitchnegus commented 6 years ago

For what it's worth, I was calculating percent differences in a class today and had a student use this function. They neglected to multiply by 100 because they didn't check the docs (and I was surprised myself at the behavior after reading them). I used it as a good teachable moment for why you should always check the docs, but I don't think that's a particularly good reason for keeping it that way. I volunteer to make the change if you're open to it, but I have to agree with sunilshah here that the standard definition of percent change is out of 100.

tr-ips-bahubali commented 5 years ago

Yup, the description should specify that it gives fractional output instead of the expected percentage change output.

matthew-brett commented 5 years ago

We just came across this too, when a student asked why the values weren't percents. I can't say why people have not complained, but it's very confusing to see that the method is called pct_change and there is "percentage change" through the docstring, when in fact it is fractional change. In particular you have:

Percentage change between the current and a prior element.

Computes the percentage change from the immediately previous row by
default. This is useful in comparing the percentage of change in a time
series of elements.

This is just incorrect. It should read:

Fractional change between the current and a prior element.

*Be careful* - despite the name of this method, it calculates fractional
change and not percentage change.

Computes the fractional change from the immediately previous row by
default. This is useful in comparing the fraction of change in a time
series of elements.

If you need the percentage change, multiply these values by 100.

All the other instances of "percent" should be "fraction".

thusithaC commented 5 years ago

I agree that this is extremely confusing. At least the docs should be changed. The current Doc says thus, which is very much misleading. Percentage is scaled to 100.

`Percentage change between the current and a prior element.

Computes the percentage change from the immediately previous row by default. This is useful in comparing the percentage of change in a time series of elements.`

brcharron commented 4 years ago

I also find this highly misleading as this function simply does not involve any percentage. I end up not using it as it makes my code less understandable. Another name which may be more common and intuitive than fractional change would be relative change.

matthew-brett commented 4 years ago

@charron - of course I completely agree, but I'm afraid we didn't get anywhere when I tried bringing this up on the list - starting here. The argument against doing anything about this appears to be, that when some presumably large proportion of Pandas users see a 'pct_change' of 1.0, from the values [1, 2], they instinctively think of this as being a proportion (rather than a percent), and they regard the actual number - 1.0 - as representing 100%, just not displayed as such. I think. But you're probably best reading it yourself, I doubt I'm doing the argument justice.

brcharron commented 4 years ago

@matthew-brett thank you for linking the extended discussion in the mailing-list, I was not aware of it. Maybe this is a matter of background as one poster mentioned. I am only surprised that this issue did not get more traction, which unfortunately lends support to the argument against doing anything.

matthew-brett commented 4 years ago

It seems obvious to me that there's at least a significant proportion of users who find this behavior somewhere between surprising and clearly incorrect. As someone who uses and teaches Pandas a lot, has contributed some commits and the original wheel build machinery, and wants it to succeed, it's sad for me that this one doesn't have a better solution than "It's fine as it is".

buhrmann commented 4 years ago

Maybe this should go into a separate issue, but I think in addition to the name of pct_change, the formula on my interpretation is not actually correct for negative values:

In [2]: pd.Series([-2, -1]).pct_change()                                                                                                                                                                                                                                        
Out[2]: 
0    NaN
1   -0.5
dtype: float64

I would expect a change from -2 to -1 to be a 50% increase, not a decrease. But perhaps there is a scenario where the negative value is what is desired?

thorr18 commented 3 years ago

This gives a per unit change. It is not a per cent change. (cent is latin for 100) Even if one doesn't want to fix the badly named function, the doc string should not say "Percentage change" for it does not actually produce a per cent change. The output is actually a relative change expressed as a per unit change. The relative change is not expressed as a percentage change. The docs and function name would both be correct if they called it relative change. They would likewise be correct if they called it more specifically a per unit change. They would not be correct if they call it a percentage change. https://en.wikipedia.org/wiki/Relative_change_and_difference#Percentage_change

MarcoGorelli commented 3 years ago

Hi @thorr18 - I agree that the docstring could do with being clarified - are you interested in submitting a pull request? If so, here's the contributing guide

Hi @buhrmann -0.5 is correct, according to the definition ( (-1 - (-2)) / -2 = - 0.5) . If you disagree, please open a new issue

gustavovargas commented 1 year ago

Yes, it should be indicated in the doc that it is a relative rate and not a percent. It is confusing.

MarcoGorelli commented 1 year ago

sure - interested in submitting a pull request to clarify the docs?

opening as a doc change

beyondguo commented 1 year ago

This is clearly an error in the document or function naming. Why are you still trying to argue? Please fix this bug quickly and stop misleading beginners!

MarcoGorelli commented 1 year ago

@beyondguo thanks for your interest and constructive criticism - are you interested in submitting a PR to clarify the docs? https://pandas.pydata.org/docs/dev/development/contributing_docstring.html

I think this comment has some good suggestions for how to clarify the docs

gustavovargas commented 1 year ago

take

matthew-brett commented 1 year ago

It's certainly a good idea to update the docstring, but it is a little difficult to do it. It seems to me the docstring would have to start with something like:

Although the method is named `pct_change`, in fact, it returns proportional change, and not percent change.
To get percent change, multiply the returned value by 100.

I do care about this, and I'm happy to write the docstring - but would such a change really be acceptable?

MarcoGorelli commented 1 year ago

looks like @gustavovargas has already taken this

I think all that needs adding is a note, just like there's a note here https://pandas.pydata.org/docs/reference/api/pandas.Series.searchsorted.html

matthew-brett commented 1 year ago

Well - that note - for those who did not follow the link - is "The Series must be monotonically sorted, otherwise wrong locations will likely be returned.". That's a perfectly reasonable caveat in usage. Here we have the problem of explaining why the method name strongly implies it is returning something other than its actual result. But - just to ask again - is something like my suggestion acceptable?

MarcoGorelli commented 1 year ago

yeah looks fine