twopirllc / pandas-ta

Technical Analysis Indicators - Pandas TA is an easy to use Python 3 Pandas Extension with 150+ Indicators
https://twopirllc.github.io/pandas-ta/
MIT License
5.39k stars 1.05k forks source link

ema, stoch are affected by values outside their area of interest #535

Open xucian opened 2 years ago

xucian commented 2 years ago

Which version are you running? The lastest version is on Github. Pip is for major releases. pandas-ta-0.3.14b0 (main) pandas-ta-0.3.65b0 (development)

Do you have TA Lib also installed in your environment? TA_Lib-0.4.24-cp38-cp38-win_amd64.whl

Did you upgrade? Did the upgrade resolve the issue? already at the latest version

Describe the bug EMA600's values are influenced by values that are even outside the last 600 observations. Shouldn't each value in an EMAX be obtained by the 'last X observations'? Why is it that even the ones before the last X observations (i.e. that aren't needed for EMA's calculation) affect its values? The difference isn't much, but if you're working on a 1m timeframe with an ema that has 256k data points (~6 months of data), the last value in that EMA would probably be very different from the last value of an EMA created from just the most recent 600 points (i.e. for simplicity, one with only 1 valid value -- the last one). Even if the difference wasn't much, my initial question holds. I'd like to disable any internal optimizations so that if EMA600 only needs 600 data points, it shouldn't care about- and be affected by what's before those 600 data points. How should I go about this without reinventing the wheel, i.e. creating a 600-len array for each individual step and create call ta.ema() multiple times?

To Reproduce

    ema_len = 600
    n_samples = 601
    samples = np.linspace(40_000, 50_000, n_samples)

    def create_ema(_samples):
        from pandas import Series
        import pandas as pd
        import pandas_ta as ta
        ser: Series = ta.ema(pd.Series(_samples, dtype=float), length=ema_len)
        return ser.to_numpy()

    def test__pands_ta_pas_outside_ema_window__influences__ema_inside_window():

        ema_full = create_ema(samples)
        ema_without_first = create_ema(samples[1:])

        assert ema_full[-1] == ema_without_first[-1]

Gives:

Expected :45008.333333333336 Actual :45008.33333333333

Expected behavior EMAs values should not be affected by input values other than those mathematically required to calculate it

Additional context If we think about any indicator that requires 'the last x observations', is should be implemented this way: [Python-like pseudocode]

indicator_values = []
vals_so_far = []
# Example: For EMA200 INDICATOR_MIN_REQUIRED_INPUTS is 200
for i in range(N_INPUTS - INDICATOR_MIN_REQUIRED_INPUTS + 1):
    cur_input = INPUTS[N_INPUTS-i-1]
    vals_so_far.insert(0, cur_input)
    if len(vals_so_far) == INDICATOR_MIN_REQUIRED_INPUTS:
        ind_val = calculate_indicator(vals_so_far)
        indicator_values.insert(0, ind_val)

        # Remove last
        vals_so_far = vals_so_far[:-1]

# If you want to pad with NaNs so that the output is of the same size as the input (this is how EMA already works)
if len(indicator_values) < N_INPUTS:
    indicator_values = [np.nan] * (N_INPUTS - len(indicator_values)) + indicator_values

In a nutshell, we move from end to start, and get a mathematically accurate indicator value at each data point. I assume pandas goes from start to end or uses some arithmetic approximations, and this is happening regardless of whether I pass talib=False or True

twopirllc commented 2 years ago

Hello @tfgstudios,

Pandas TA is largely a Python implementation of TA Lib (and some few TradingView indicators) and thus the default mode for this Open Source implementation.


For the sake of brevity, I am only addressing ema.

This bug/feature sounds remarkably similar to Issue #420, TA Lib and it's Unstable Period as well as code and documentation of TA Lib's EMA for more details. I chose to implement TA_MA_CLASSIC computation.

Pandas TA currently has three options for an ema, help(ta.ema). You've have tried two of them (1 & 2).

  1. With TA Lib installed, it defaults to TA Lib.
  2. If talib=False or TA Lib is not installed, it yields the same result as 1., as intended and you noted.
  3. When talib=False and presma=False are arguments. (All three in the code and charts below).
    • This attempts to address:
    • Expected behavior EMAs values should not be affected by input values other than those mathematically required to calculate it

  4. If none of those suit your purposes, you are encouraged to include your own and submit it (preferably in numpy/numba). 😎 There are many implementations of "ema" beyond TA Lib's EMA. Unfortunately, I have not had enough time nor support to address them all.


_df = ta.df.ta.ticker("AA", timed=True)["Close"]

def bad_ema(src=None, length=None, n_samples=601):
    if src is None:
        src = pd.Series(np.linspace(40_000, 50_000, n_samples), dtype=float)

    tal_ema = ta.ema(src, length=length, talib=True)
    pta_presma_ema = ta.ema(src, length=length, talib=False)
    pta_ema = ta.ema(src, length=length, talib=False, presma=False)

    return pd.DataFrame({
        "close": src,
        f"tal_ema{length}": tal_ema,
        f"pta_presma_ema{length}": pta_presma_ema,
        f"pta_ema{length}": pta_ema
    })

def cplot(df, last=None):
    if isinstance(last, int):
        df = df.iloc[-last:,:]
    print(df.shape)
    df.plot(figsize=(16,6), color=["black", "red", "orange", "green"] , grid=True)


ma_length = 10
n_samples = 50
closedf = _df.iloc[:n_samples].copy()

df = bad_ema(closedf, length=ma_length, n_samples=n_samples)
cplot(df, last=None)
df.tail()
Screen Shot 2022-05-24 at 7 32 24 PM

It is clearly evident that _ptaema10 (green line), in this case, adheres to "EMAs values should not be affected by input values other than those mathematically required to calculate it" expectation you desire. Which by one definition of ema relies on a minimum of two values.

Kind Regards, KJ

xucian commented 2 years ago

Hi @twopirllc Thanks for the detailed answer!

I'm not sure I understand this part: '_It is clearly evident that ptaema10 (green line), in this case, adheres to'

  1. why is clear?
  2. why does the green line start from the very beginning while the orange one (and the red one that's supposedly beneath it) starts latter? Any values at positions 0, n-2 should be NaNs for EmaN (since it's not possible to calculate them). This is not important for me, as I always strip the first n-2 values, but it might be a problem performance-wise. I presume it's about your ending statement 'Which by one definition of ema relies on a minimum of two values', but it doesn't click for me.
  3. I tried the presma arg in my code using the latest release and it doesn't look like it'd work, i.e. the last values of 2 emas that start at different positions are different (EmaN where: first starts at i and length N+1, second starts at i+1 and length N. For reference, N is 150 here). *()**
  4. Also used unstable_period with 'ALL', and 'EMA' (just to make sure), and I don't get the desired result in any of the trials:
    unstable_period None,       talib True,     presma True:    0 out of 1 are equal
    unstable_period 0,      talib True,     presma True:    0 out of 1 are equal
    unstable_period 5,      talib True,     presma True:    0 out of 1 are equal
    unstable_period 34,         talib True,     presma True:    0 out of 1 are equal
    unstable_period 64,         talib True,     presma True:    0 out of 1 are equal
    unstable_period 65,         talib True,     presma True:    0 out of 1 are equal
    unstable_period 66,         talib True,     presma True:    0 out of 1 are equal
    unstable_period 100,        talib True,     presma True:    0 out of 1 are equal
    unstable_period 149,        talib True,     presma True:    0 out of 1 are equal
    unstable_period 150,        talib True,     presma True:    0 out of 1 are equal
    unstable_period 151,        talib True,     presma True:    0 out of 1 are equal
    unstable_period 200,        talib True,     presma True:    0 out of 1 are equal
    unstable_period 300,        talib True,     presma True:    0 out of 1 are equal
    unstable_period 600,        talib True,     presma True:    0 out of 1 are equal
    unstable_period 700,        talib True,     presma True:    0 out of 1 are equal
    unstable_period 900,        talib True,     presma True:    0 out of 1 are equal
    unstable_period 1500,       talib True,     presma True:    0 out of 1 are equal
    unstable_period 2000,       talib True,     presma True:    0 out of 1 are equal
    unstable_period None,       talib True,     presma False:   0 out of 1 are equal
    unstable_period 0,      talib True,     presma False:   0 out of 1 are equal
    unstable_period 5,      talib True,     presma False:   0 out of 1 are equal
    unstable_period 34,         talib True,     presma False:   0 out of 1 are equal
    unstable_period 64,         talib True,     presma False:   0 out of 1 are equal
    unstable_period 65,         talib True,     presma False:   0 out of 1 are equal
    unstable_period 66,         talib True,     presma False:   0 out of 1 are equal
    unstable_period 100,        talib True,     presma False:   0 out of 1 are equal
    unstable_period 149,        talib True,     presma False:   0 out of 1 are equal
    unstable_period 150,        talib True,     presma False:   0 out of 1 are equal
    unstable_period 151,        talib True,     presma False:   0 out of 1 are equal
    unstable_period 200,        talib True,     presma False:   0 out of 1 are equal
    unstable_period 300,        talib True,     presma False:   0 out of 1 are equal
    unstable_period 600,        talib True,     presma False:   0 out of 1 are equal
    unstable_period 700,        talib True,     presma False:   0 out of 1 are equal
    unstable_period 900,        talib True,     presma False:   0 out of 1 are equal
    unstable_period 1500,       talib True,     presma False:   0 out of 1 are equal
    unstable_period 2000,       talib True,     presma False:   0 out of 1 are equal
    unstable_period None,       talib False,    presma True:    0 out of 1 are equal
    unstable_period 0,      talib False,    presma True:    0 out of 1 are equal
    unstable_period 5,      talib False,    presma True:    0 out of 1 are equal
    unstable_period 34,         talib False,    presma True:    0 out of 1 are equal
    unstable_period 64,         talib False,    presma True:    0 out of 1 are equal
    unstable_period 65,         talib False,    presma True:    0 out of 1 are equal
    unstable_period 66,         talib False,    presma True:    0 out of 1 are equal
    unstable_period 100,        talib False,    presma True:    0 out of 1 are equal
    unstable_period 149,        talib False,    presma True:    0 out of 1 are equal
    unstable_period 150,        talib False,    presma True:    0 out of 1 are equal
    unstable_period 151,        talib False,    presma True:    0 out of 1 are equal
    unstable_period 200,        talib False,    presma True:    0 out of 1 are equal
    unstable_period 300,        talib False,    presma True:    0 out of 1 are equal
    unstable_period 600,        talib False,    presma True:    0 out of 1 are equal
    unstable_period 700,        talib False,    presma True:    0 out of 1 are equal
    unstable_period 900,        talib False,    presma True:    0 out of 1 are equal
    unstable_period 1500,       talib False,    presma True:    0 out of 1 are equal
    unstable_period 2000,       talib False,    presma True:    0 out of 1 are equal
    unstable_period None,       talib False,    presma False:   0 out of 1 are equal
    unstable_period 0,      talib False,    presma False:   0 out of 1 are equal
    unstable_period 5,      talib False,    presma False:   0 out of 1 are equal
    unstable_period 34,         talib False,    presma False:   0 out of 1 are equal
    unstable_period 64,         talib False,    presma False:   0 out of 1 are equal
    unstable_period 65,         talib False,    presma False:   0 out of 1 are equal
    unstable_period 66,         talib False,    presma False:   0 out of 1 are equal
    unstable_period 100,        talib False,    presma False:   0 out of 1 are equal
    unstable_period 149,        talib False,    presma False:   0 out of 1 are equal
    unstable_period 150,        talib False,    presma False:   0 out of 1 are equal
    unstable_period 151,        talib False,    presma False:   0 out of 1 are equal
    unstable_period 200,        talib False,    presma False:   0 out of 1 are equal
    unstable_period 300,        talib False,    presma False:   0 out of 1 are equal
    unstable_period 600,        talib False,    presma False:   0 out of 1 are equal

Update: I now see that presma is present in the development branch. I'm quite reticent to using that branch. It can have more stability issues, right? I also remember trying to switch to it in the past, but it returned different dimensioned arrays for some indicators (IIRC, the stoch returned 3 arrays instead of 2, and my code logic relies heavily on it returning 2. I can just ignore the additional array, but makes me wonder if there won't be other subtle but important breaking changes)

And lastly, about presma: if this would fix the EMA, are there options to fix NATR, STOCH and VWAP as well? Or, at least for STOCH? (I can't reproduce the difference for NATR and VWAP anymore -- maybe it was an error on my part) If there's a solution for stoch, I'd prefer it to not alter the current values the stoch produces in a significant way -- I've already trained few hundreds parameters based on it and can't afford retraining the models.

Thanks again!

Update2: I tried the dev branch and this test doesn't fail anymore (but notice presma=True, talib=False, otherwise it fails with 45008.333333333336 != 45008.33333333333):

def test_indicators_are_not_affected_by_values_outside_their_area_of_interest():
    ema_len = 600
    n_samples = 601
    samples = np.linspace(40_000, 50_000, n_samples)

    def create_ema(_samples):
        from pandas import Series
        import pandas as pd
        import pandas_ta as ta
        ser: Series = ta.ema(pd.Series(_samples, dtype=float), length=ema_len, presma=True, talib=False)
        return ser.to_numpy()

    def _test__pands_ta__values_outside_ema_window__does_not_influence__ema_inside_window():
        ema_full = create_ema(samples)
        ema_without_first = create_ema(samples[1:])

        assert ema_full[-1] == ema_without_first[-1]

    _test__pands_ta__values_outside_ema_window__does_not_influence__ema_inside_window()

But my other tests still fail in some cases (described by *()** above).

twopirllc commented 2 years ago

@tfgstudios,

I now see that presma is present in the development branch. I'm quite reticent to using that branch. It can have more stability issues, right?

Regarding ema, it made more sense to rename the argument sma to presma from v0.3.14 to development.

The development branch is equally stable as it's former self, v0.3.14 and better. Whether you decide to use the development branch or not, I will not be supporting v0.3.14 as it will get replaced by a future version of the development branch after completing TODO's Hilbert Transform Indicators, ht_*, under remaining Indicators.

I also remember trying to switch to it in the past, but it returned different dimensioned arrays for some indicators (IIRC, the stoch returned 3 arrays instead of 2, and my code logic relies heavily on it returning 2. I can just ignore the additional array, but makes me wonder if there won't be other subtle but important breaking changes)

This library is more feature rich in comparison to some other TA libraries out there and thus some indicators will have more details/columns included with the result, like stoch. It is up to the user to drop or exclude extra columns that has no value to them. Others fork the repo and make adjustments.

If there's a solution for stoch, I'd prefer it to not alter the current values the stoch produces in a significant way -- I've already trained few hundreds parameters based on it and can't afford retraining the models.

The next time I will be touching it is when I convert it to numpy/numba. At the current rate, it won't be anytime soon.

But my other tests still fail in some cases (described by (*) above).

There are several other TA libraries out there. Have you tried them? I am curious if they have solved "indicator(s) are affected by values outside their area of interest"? 🤔

KJ


Addendum

Expected behavior EMAs values should not be affected by input values other than those mathematically required to calculate it

This is what I hear:

Screen Shot 2022-05-25 at 6 23 31 AM

As shown above, set talib=False, presma=False only uses two consecutive values as detailed here.

This calculation is done by Panda's ewm.

def ema(*args, **kwargs):
    # ...
    close.ewm(span=length, adjust=adjust).mean()  # where adjust=False
    # ...