ATR and PPO results are sensitive to dataframe length

manujchandra commented 2 years ago

Which version are you running? The latest version is on Github. Pip is for major releases. 0.3.14b0

Upgrade. Using Latest

Describe the bug The indicator values are dependent on the length of the data-frame.

Suppose you are calculating ATR(20). Ideally, this should only require 21 length data frame approximately. But if you will use a lets say 21 to 145 length data-frame, it will give incorrect last value. Through experimentation I have found that minimum 145 length data frame is needed for ATR(20) to give approximately correct results.

I have tried this on on PPO also with similar results. PPO uses 12 and 26 values for calculation, but 145 rows are needed to get approximately correct answer.

SMA seems to be working fine. A 20 period SMA gives correct result with a 20 length dataframe.

I have tried my custom indicators on 50 length data-frame and got correct results.

To Reproduce A working notebook with data is supplied that demonstrate this problem. atr.zip

Expected behavior If an indicator needs x values of data to compute its next result, only x+1 rows should be required.

twopirllc commented 2 years ago

Hello @manujchandra,

Thanks for sharing your notebook and data. I will take a look as soon as I can.

Kind Regards, KJ

manujchandra commented 2 years ago

Hi,

Just to clarify, its not just ATR, but PPO is also displaying the same behavior.

SMA is working fine.

Apart from these 3 I have not tested with other indicators.

Regards,

asjiLab commented 2 years ago

I think i am having the same issue with the SuperTrend indicator, which is based on ATR as well..

Yury-MonZon commented 2 years ago

Same here. Please, help. ATR, SuperTrend and EMA are changing their last values with the amount of data supplied.

Yury-MonZon commented 2 years ago

Here is what I've tried and my results (latest version from git): EMA:

ema_df = pd.DataFrame() # erase old stuff
ema_df['EMA_200'] = ta.ema(ticker_df['price'].tail(201), length=200)
print(f"{ema_df['EMA_200'].iloc[-1]=}")

ema_df = pd.DataFrame() # erase old stuff
ema_df['EMA_200'] = ta.ema(ticker_df['price'].tail(500), length=200)
print(f"{ema_df['EMA_200'].iloc[-1]=}")

ema_df = pd.DataFrame() # erase old stuff
ema_df['EMA_200'] = ta.ema(ticker_df['price'].tail(1000), length=200)
print(f"{ema_df['EMA_200'].iloc[-1]=}")

EMA results:

ema_df['EMA_200'].iloc[-1]=163.81475995024877
ema_df['EMA_200'].iloc[-1]=165.91070539427534
ema_df['EMA_200'].iloc[-1]=165.7972346903986

ATR:

atr_df = pd.DataFrame() # erase old stuff
atr_df['ATRr_10'] = ta.atr(high=ticker_df['high'].tail(1000), low=ticker_df['low'].tail(1000), close=ticker_df['price'].tail(1000), length=10, mamode='RMA')
print(f"{atr_df['ATRr_10'].iloc[-1]=}")

atr_df = pd.DataFrame() # erase old stuff
atr_df['ATRr_10'] = ta.atr(high=ticker_df['high'].tail(500), low=ticker_df['low'].tail(500), close=ticker_df['price'].tail(500), length=10, mamode='RMA')
print(f"{atr_df['ATRr_10'].iloc[-1]=}")

atr_df = pd.DataFrame() # erase old stuff
atr_df['ATRr_10'] = ta.atr(high=ticker_df['high'].tail(50), low=ticker_df['low'].tail(50), close=ticker_df['price'].tail(50), length=10, mamode='RMA')
print(f"{atr_df['ATRr_10'].iloc[-1]=}")

ATR results:

atr_df['ATRr_10'].iloc[-1]=1.9086574262169047
atr_df['ATRr_10'].iloc[-1]=1.9086574262169047
atr_df['ATRr_10'].iloc[-1]=1.9111651869095314

SuperTrend:

st_df = pd.DataFrame()
st_df = ta.supertrend(high=ticker_df['high'].tail(1000), low=ticker_df['low'].tail(1000), close=ticker_df['price'].tail(1000), length=10, multiplier=3)
print(f"{st_df['SUPERT_10_3.0'].iloc[-1]=}")    

st_df = pd.DataFrame()
st_df = ta.supertrend(high=ticker_df['high'].tail(500), low=ticker_df['low'].tail(500), close=ticker_df['price'].tail(500), length=10, multiplier=3)
print(f"{st_df['SUPERT_10_3.0'].iloc[-1]=}")    

st_df = pd.DataFrame()
st_df = ta.supertrend(high=ticker_df['high'].tail(50), low=ticker_df['low'].tail(50), close=ticker_df['price'].tail(50), length=10, multiplier=3)
print(f"{st_df['SUPERT_10_3.0'].iloc[-1]=}")

SuperTrend results:

st_df['SUPERT_10_3.0'].iloc[-1]=156.8781756526552
st_df['SUPERT_10_3.0'].iloc[-1]=156.8781756526552   
st_df['SUPERT_10_3.0'].iloc[-1]=156.88717150861518

rengel8 commented 2 years ago

This issue is very interesting and also covering the question how much pre-roll would be needed for a certain interval, to allow reproduce-able results with certain indicators or a combination/concatenation of them.

There are only a few indicators, which have a simple sliding window, which must - according to my understanding - be the reason for the SMA to be fine in this regard.

Many other ones are to some extent recursive. So I would say that there are two important things about an indicator.

it should be implemented and optimised for
- correct functioning (quality of results [compared to..])
- optimised for speed (performance [by keeping the quality])
pre-roll depending on interval and other parameters to create reproduce-able results

I tried these (based on @Yury-MonZon) settings with pandas_ta and TA-Lib, which both do behave the exact same way.

ema_df = pd.DataFrame() # erase old stuff
ema_df['EMA_200'] = ta.ema(df['close'].tail(201), length=200)
print(f"pre-roll: 201, {ema_df['EMA_200'].iloc[-1]=}")

ema_df = pd.DataFrame() # erase old stuff
ema_df['EMA_200'] = ta.ema(df['close'].tail(460), length=200)
print(f"pre-roll: 460, {ema_df['EMA_200'].iloc[-1]=}")

ema_df = pd.DataFrame() # erase old stuff
ema_df['EMA_200'] = ta.ema(df['close'].tail(461), length=200)
print(f"pre-roll: 461, {ema_df['EMA_200'].iloc[-1]=}")

ema_df = pd.DataFrame() # erase old stuff
ema_df['EMA_200'] = ta.ema(df['close'].tail(500), length=200)
print(f"pre-roll: 500, {ema_df['EMA_200'].iloc[-1]=}")

ema_df = pd.DataFrame() # erase old stuff
ema_df['EMA_200'] = ta.ema(df['close'].tail(1000), length=200)
print(f"pre-roll: 1000, {ema_df['EMA_200'].iloc[-1]=}")

So the pre-roll for an EMA 200 would be 461 bars/candles of data (see below).

pre-roll: 201, ema_df['EMA_200'].iloc[-1]=56838.89323333332
pre-roll: 460, ema_df['EMA_200'].iloc[-1]=54903.80816102022
pre-roll: 461, ema_df['EMA_200'].iloc[-1]=54906.84612077933
pre-roll: 500, ema_df['EMA_200'].iloc[-1]=54906.84612077933
pre-roll: 1000, ema_df['EMA_200'].iloc[-1]=54906.84612077933

As a summarizing conclusion I would say, that the described behaviour is important to keep in mind and closely related to the code itself. I would not consider this to be a bug or an imperfection. Especially looking at some indicators of John Ehler's, which have variables defined to be empty and then roll along the calculation and are read and updated all the time. Here an overall pre-roll is related to the weight of these variables.

Regards, rengel8

Yury-MonZon commented 2 years ago

I see it from a different perspective, if indicator length is specified, then it should work the same on any amount of pre roll, with minimum pre roll specified in docs. I mean if it doesn't have enough to give correct results it should raise an exception to warn a coder that he is not using it right.

twopirllc commented 2 years ago

Hello @manujchandra, @Yury-MonZon, @asjiLab,

I understand what you are talking about it as well as the severity on why some indicators are not converging accurately within their lengths or periods as expected. Additionally, none of you have mentioned if you also have TA Lib installed in your environment and whether or not that also has an impact on the results for comparison.

I also want to thank those that have provided relevant detailed examples. But I would appreciate a full list of indicators that behave similarly as well.

Unfortunately, I do not have an expected time frame on when I, @rengel8 or anyone else can address this Issue with so many present and future outstanding Bugs, Issues, and Indicator/Feature Requests. To be frank, maintenance for this package has grown beyond the means for one individual and part-time contributors. 😞 I would love to iron out these convergence problems, so I am hoping one or more of you are willing to help contribute to expedite this matter. 😎

KJ

manujchandra commented 2 years ago

Hi @twopirllc

I do not have TA Lib installed.

I have developed a back-tester which is using pandas-ta for some of the indicators. Because I have to pass 150 rows for each stock for each day, the back-testing is significantly slower, as opposed to sending only say 50 rows.

I have not tried it on all the indicators as there are many of them, but I believe if we can fix one, the others should also get fixed. I cannot think of any reason why some indicators need more data points than others.

Yury-MonZon commented 2 years ago

Hi @twopirllc

Sorry for the lack of information. I've installed TA-Lib 0.4.22 and tried ema200 again. Pre-rolls are: 201, 500, 1000: Without TA-Lib:

ema_df['EMA_200'].iloc[-1]=147.74008184079602
ema_df['EMA_200'].iloc[-1]=148.0988116897418
ema_df['EMA_200'].iloc[-1]=148.17018972736756

With TA-Lib:

ema_df['EMA_200'].iloc[-1]=147.7400818407959
ema_df['EMA_200'].iloc[-1]=148.09881168974118
ema_df['EMA_200'].iloc[-1]=148.17018972736673

And I didn't get anything good with 461 samples either, like @rengel8 did.

twopirllc commented 2 years ago

Hello @manujchandra,

I have developed a back-tester which is using pandas-ta for some of the indicators.

Cool! I saw the video you posted.

Because I have to pass 150 rows for each stock for each day, the back-testing is significantly slower, as opposed to sending only say 50 rows.

Are you using df.ta.strategy(your_ta) to run a set of indicators or standalone like TA Lib atr = ta.atr(df.high, df.low, df.close). Let's skip these anecdotal descriptions, how long for 50 rows vs 150 rows?

Use df.ta.strategy(your_ta, cores=0) when using a small set indicators. This disables the multiprocessing pool, since initializing one is time consuming as well. For a large number of indicators, set cores to an appropriate value for your system.

but I believe if we can fix one, the others should also get fixed.

I agree... but I want to know all that should be fixed so it can be triaged accordingly.

I cannot think of any reason why some indicators need more data points than others.

Nor can I. Since TA is simply quantitative feature generation, I would appreciate someone with your DS expertise to look at the source to help iron out what I did wrong and how to fix it. More than one or two sets of eyes definitely helps.

KJ

rengel8 commented 2 years ago

pandas-ta = 0.3.36b0 TA-Lib = 0.4.19

EDIT: I used a script, with a trimmed data frame with only 461 candles, which is of course the reason for the identical pre-roll behaviour. I recognized this on the go for correlation and "pre-roll" for ATR.

I do not have enough time these days to look into this with more detail, but I'll try. I'm sorry for the misleading.

pre-roll: 201, ema_df['EMA_200'].iloc[-1]=53157.9684
pre-roll: 460, ema_df['EMA_200'].iloc[-1]=52991.141117025014
pre-roll: 461, ema_df['EMA_200'].iloc[-1]=52990.44503030303
pre-roll: 3163, ema_df['EMA_200'].iloc[-1]=52983.53322663892
pre-roll: 3164, ema_df['EMA_200'].iloc[-1]=52983.53322663891
pre-roll: 3250, ema_df['EMA_200'].iloc[-1]=52983.53322663891
pre-roll: 3300, ema_df['EMA_200'].iloc[-1]=52983.53322663891
pre-roll: 3500, ema_df['EMA_200'].iloc[-1]=52983.53322663891
pre-roll: 4000, ema_df['EMA_200'].iloc[-1]=52983.53322663891
pre-roll: 5000, ema_df['EMA_200'].iloc[-1]=52983.53322663891
 - 10x pandas-ta EMA, took (s) 0.06

pre-roll (talib): 200, ema_df['EMA_200'].iloc[-1]=53157.968400000005
pre-roll (talib): 460, ema_df['EMA_200'].iloc[-1]=52991.14111702479
pre-roll (talib): 461, ema_df['EMA_200'].iloc[-1]=52990.44503030279
pre-roll (talib): 3163, ema_df['EMA_200'].iloc[-1]=52983.53322663868
pre-roll (talib): 3164, ema_df['EMA_200'].iloc[-1]=52983.53322663868
pre-roll (talib): 3250, ema_df['EMA_200'].iloc[-1]=52983.533226638676
pre-roll (talib): 3300, ema_df['EMA_200'].iloc[-1]=52983.53322663868
pre-roll (talib): 3500, ema_df['EMA_200'].iloc[-1]=52983.53322663868
pre-roll (talib): 4000, ema_df['EMA_200'].iloc[-1]=52983.53322663868
pre-roll (talib): 5000, ema_df['EMA_200'].iloc[-1]=52983.53322663868
 - 10x talib EMA, took (s) 0.026

pre-roll (tulipy): 200, ema_df['EMA_200'].iloc[-1]=53279.61434027325
pre-roll (tulipy): 460, ema_df['EMA_200'].iloc[-1]=52980.69225942165
pre-roll (tulipy): 461, ema_df['EMA_200'].iloc[-1]=52981.757512137774
pre-roll (tulipy): 3163, ema_df['EMA_200'].iloc[-1]=52983.533226638676
pre-roll (tulipy): 3164, ema_df['EMA_200'].iloc[-1]=52983.533226638676
pre-roll (tulipy): 3250, ema_df['EMA_200'].iloc[-1]=52983.533226638676
pre-roll (tulipy): 3300, ema_df['EMA_200'].iloc[-1]=52983.53322663868
pre-roll (tulipy): 3500, ema_df['EMA_200'].iloc[-1]=52983.53322663868
pre-roll (tulipy): 4000, ema_df['EMA_200'].iloc[-1]=52983.53322663868
pre-roll (tulipy): 5000, ema_df['EMA_200'].iloc[-1]=52983.53322663868
 - 10x tulipy EMA, took (s) 0.03

I am also interested in investigating this further, although I'm sort of convinced that this is caused by recursion, which will be even more severe on adaptive intervall indicators. Typically an indicator has a warm-up phase to fill the default buffers of 0 and then works in stream mode, where only 1 new candle is added and the last intervall is then recalculated with all former filled buffers. To work sequential creates the need to fill the buffers to some extend depending on the indicator itself (for each run). For an EMA of the intervall 200, this seems to be somewhere beyond 3000 bars depending on the implementation, which is everything but not ideal.

twopirllc commented 2 years ago

@manujchandra, @Yury-MonZon, @asjiLab, @rengel8,

While helping another user using a different python application, a similar issue has been noted since that application also used TA Lib for some indicators. Apparently, this is a known Issue and discussed in TA Lib's documentation regarding some indicators Unstable Period. Please read when you get a chance.

KJ

teddywaweru commented 2 years ago

Hello all. This is in response to @manujchandra who had an issue with the ATR. So, apparently the standard ATR indicator on MT4 uses the sma in it's calculations, but the MA used in the calculation in the atr,py defaults to 'rma' if no mamode is specified. I've tried testing it out & sma does come out more accurate in comparison to the rma (for this specific indicator). Maybe this could show up as the 'sensitivity' you're referring to? On using: df.ta.atr(length = 14, mamode = 'sma'), I can even use the minimum number of data points for a length = 14, with a margin of ~5 data points & get values similar to the MT4 platform indicator. Note that in this case, I'm taking the values from the MT4 platform to be the real values expected from the indicator. I can't respond to the issue concerning different values depending on the periods selected. @twopirllc could the change on the atr.py be pushed?

twopirllc commented 2 years ago

Hello @teddywaweru,

Thanks for the research and feedback regarding this Issue. Did you also read the details covered in my last comment regarding the Unstable Period? If I understood it correctly, it is a known and documented side effect. Especially so when indicators are compositions or chained from other indicators.

could the change on the atr.py be pushed?

In the ATR Comment below, TA Lib defaults to Wilder's Method, aka rma. Since some users prefer different MAs, I included the argument mamode for users to change the smoothing according to their rationale. To keep this library as consistent as possible with TA Lib, I will not be changing the default mamode. However since it is Open Source, you are more than welcome to edit your local copy to suit your purposes. 😎

ATR Comment

/* Average True Range is the greatest of the following: 
 *
 *  val1 = distance from today's high to today's low.
 *  val2 = distance from yesterday's close to today's high.
 *  val3 = distance from yesterday's close to today's low.   
 *
 * These value are averaged for the specified period using
 * Wilder method. This method have an unstable period comparable
 * to and Exponential Moving Average (EMA).
*/

Kind Regards, KJ

teddywaweru commented 2 years ago

I did read the article, & the author references the EMA having this particular issue, which i assume is due to how its calculated. Chained indicators may not have this issue if their calculations don't have the 'function with memory' property (explaining why the SMA works for me). I understand about the consistency on the MA used. I'll stick to declaring mamode where it suits. Cheers. 👍 & thanks for the response.

twopirllc / pandas-ta

ATR and PPO results are sensitive to dataframe length #420