pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.63k stars 17.91k forks source link

pct_change computes incorrect values #18920

Closed bgits closed 6 years ago

bgits commented 6 years ago

btc_nvt.txt

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

def to_weekly(dataframe, field=None):
    dataframe.index = pd.to_datetime(dataframe.index)
    if field:
        dataframe = dataframe[field]
    return dataframe.resample('W').mean()

btc_frame = pd.read_csv('btc_nvt.csv')
btc_frame = btc_frame.shift(periods=1, freq=None, axis=1)
btc_frame = btc_frame.drop(['Date'], axis=1)
btc_frame['nvt'] = btc_frame['marketcap(USD)'] / btc_frame['txVolume(USD)']
btc_frame['price(%)'] = btc_frame['price(USD)'].pct_change(1)
btc_frame['% man'] = (btc_frame['price(USD)'] - btc_frame['price(USD)'].shift(1)) / btc_frame['price(USD)'].shift(1)
btc_frame['1 shift'] = btc_frame['price(USD)'].shift(1)

btc_frame = to_weekly(btc_frame)
print(btc_frame[['price(USD)', 'price(%)', '% man', '1 shift']])

Problem description

The values in price(%) and in % man are incorrect. It not clear how those values are computed because they do not align with pct_change. Another issue is the 1 shift column does not seem to be price(USD) shifted by 1.

I have tried using the latest version of Pandas as well via a pip install with the same result. Perhaps there is also a chance this is related to the way the csv is being imported and prepared? I'm attaching the csv as well (as .txt since github does not support .csv).

              price(USD)  price(%)     % man       1 shift
2013-04-28    134.210000       NaN       NaN           NaN
2013-05-05    118.842857 -0.015728 -0.015728    121.457143
2013-05-12    113.925714 -0.000890 -0.000890    114.055714
2013-05-19    118.710000  0.008954  0.008954    117.711429
2013-05-26    127.732857  0.013101  0.013101    126.091429
2013-06-02    128.634286 -0.012134 -0.012134    130.232857
2013-06-09    114.727143 -0.027950 -0.027950    117.911429
2013-06-16    103.840000 -0.000162 -0.000162    103.910000

Expected Output

              price(USD)  price(%)     % man       1 shift
2013-04-28    134.210000       NaN       NaN           NaN
2013-05-05    118.842857 -0.1145     -0.1145    134.210000
2013-05-12    113.925714  -0.041       -0.041      118.842857

Output of pd.show_versions()

[paste the output of ``pd.show_versions()`` here below this line] INSTALLED VERSIONS ------------------ commit: None python: 3.6.2.final.0 python-bits: 64 OS: Darwin OS-release: 17.2.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.20.3 pytest: None pip: 9.0.1 setuptools: 36.4.0 Cython: None numpy: 1.13.1 scipy: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None pandas_gbq: None pandas_datareader: None
jreback commented 6 years ago

can u show a minimal example

bgits commented 6 years ago

@jreback Can you clarify what you would expect as a minimal example? You can match the top 3 lines with the 3 lines in the expected output.

ie: on 2013-05-05, price(%) is -0.015728 it should be about -0.1145

bgits commented 6 years ago

Upon closer inspection this line is causing the deviation: btc_frame = to_weekly(btc_frame) if I move it up to the preprocessing stage before any calculations are done the weekly is as expected and the original output would be correct as well given that the weekly resample is being done on daily percentage changes.

This seems like intended behavior of pandas and just a mistake on my part. If this is indeed the expected behavior of pandas then we can close this issue.

jreback commented 6 years ago

pct_change knows nothing about freq so you should resample first

babrik commented 6 years ago

For below data:   | Date | Open | High | Low | Close | Adj Close 2013-01-02 | 19693.300781 | 19756.679688 | 19686.500000 | 19714.240234 | 19714.240234 2013-01-03 | 19771.029297 | 19786.300781 | 19693.289063 | 19764.779297 | 19764.779297 2013-01-04 | 19782.589844 | 19797.439453 | 19679.990234 | 19784.080078 | 19784.080078 2013-01-07 | 19820.560547 | 19856.429688 | 19654.460938 | 19691.419922 | 19691.419922 2013-01-08 | 19681.380859 | 19761.779297 | 19632.589844 | 19742.519531 | 19742.519531

While doing: dframe.pct_change() I am getting below error: TypeError Traceback (most recent call last) C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\ops.py in na_op(x, y) 1008 try: -> 1009 result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs) 1010 except TypeError:

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\computation\expressions.py in evaluate(op, op_str, a, b, use_numexpr, eval_kwargs) 204 if use_numexpr: --> 205 return _evaluate(op, op_str, a, b, eval_kwargs) 206 return _evaluate_standard(op, op_str, a, b)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\computation\expressions.py in _evaluate_numexpr(op, op_str, a, b, truediv, reversed, **eval_kwargs) 119 if result is None: --> 120 result = _evaluate_standard(op, op_str, a, b) 121

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\computation\expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs) 64 with np.errstate(all='ignore'): ---> 65 return op(a, b) 66

TypeError: unsupported operand type(s) for /: 'str' and 'float'