pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.82k stars 17.65k forks source link

BUG: Inconsistent resampling behaviour #35335

Closed LazyTrader closed 4 years ago

LazyTrader commented 4 years ago

Attached below is my code for extracting processing financial tick data from the MetaTrader5 application:

def CollectTicks(Year, Month, Day, Hour, Minute, CurrencyPair, Digits, ResampleOffset = None):

    StartTime = datetime.datetime(Year, Month, Day, Hour, Minute).timestamp() #Convert the start time to epoch time

    EndTime = StartTime + 604800 #Add 1 week to the start time and set the result as the end time

    Ticks = mt5.copy_ticks_range(CurrencyPair, StartTime, EndTime, mt5.COPY_TICKS_ALL) #Retrieve the ticks from the start time until the end time

    BreakFlag = False #Set the break flag to false

    while(True):

        StartTime = EndTime #Set the prevous end time as the new start time

        EndTime = StartTime + 604800 #Add 1 week to the new start time and set the result as the new end time

        if(EndTime > time.time()): #If the end time is more than the current time

            EndTime = time.time() #Set the end time as the current time

            BreakFlag = True #Set the break flag to true

        NewTicks = mt5.copy_ticks_range(CurrencyPair, StartTime, EndTime, mt5.COPY_TICKS_ALL) #Retrive the new ticks with the new start time and new end time

        Ticks = np.hstack((Ticks, NewTicks)) #Cocatenate the newly retrieved ticks with the previously retrieved ticks

        if(BreakFlag == True): #If the break flag is set to true

            break #Exit the loop

    TickDataframe = pd.DataFrame(Ticks) #Covert the ticks array to a dataframe

    TickDataframe.drop(["time", "last", "volume", "flags", "volume_real"], axis = 1, inplace = True) #Drop the unwanted columns from the dataframe

    TickDataframe["time_msc"] = pd.to_datetime(TickDataframe["time_msc"], unit = "ms") #Convert the milisecond timestap to the epoch date time

    TickDataframe.set_index("time_msc", inplace = True) #Set the milisecond time column as the dataframe's index

    TickDataframe.round(Digits) #Round all values in the dataframe to the desired number of digits

    TickDataframe["mid"] = round(((TickDataframe["bid"] + TickDataframe["ask"]) / 2.0), Digits) #Generate the mid dataframe and round it to (Digits) figures

    if(ResampleOffset != None): #If the ResampleOffset input is not None

        CandleCloses = {} #Instantiate a dictionary

        CandleCloses["Bid Closes"] = TickDataframe["bid"].resample(ResampleOffset).ohlc().close.dropna().to_numpy() #Resample the bid ticks to create bid candles based on the resample offset and convert it into a numpy array

        CandleCloses["Ask Closes"] = TickDataframe["ask"].resample(ResampleOffset).ohlc().close.dropna().to_numpy() #Resample the ask ticks to create ask candles based on the resample offset and convert it into a numpy array

        CandleCloses["Mid Closes"] = TickDataframe["mid"].resample(ResampleOffset).ohlc().close.dropna().to_numpy() #Resample the mid ticks to create mid candles based on the resample offset and convert it into a numpy array

        return CandleCloses #Return the dictionary containing candle closes

    else: #If the ResampleOffset input is None

        return TickDataframe #Return the dataframe of ticks

Problem description

After resampling the collected tick data with the "1Min" offset, the resampled data for 2 different instruments have a different number of candles even when their time indexes cover the same date range.

Additional Observations

The other offsets do not have this inconsistent behaviour, the issue only happens with the "1Min" resample offset.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.8.3.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : en LOCALE : English_Singapore.1252 pandas : 1.0.5 numpy : 1.18.5 pytz : 2020.1 dateutil : 2.8.1 pip : 20.1.1 setuptools : 49.2.0.post20200714 Cython : None pytest : None hypothesis : None sphinx : 3.1.2 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.16.1 pandas_datareader: None bs4 : 4.9.1 bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : 3.2.2 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pytest : None pyxlsb : None s3fs : None scipy : 1.5.0 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : 0.50.1
jreback commented 4 years ago

pls show only copy passable pandas in the example - or report to the other package

LazyTrader commented 4 years ago

I have 2 dataframes containing financial tick data from 17/04/2020 16:00 - 17/07/2020 23:55. Calling print on them results in the outputs listed below:

Code Sample, a copy-pastable example

print(Asset1Ticks)

print(Asset2Ticks)

                             bid      ask      mid
time_msc                                          
2020-04-17 16:00:00.060  1.24828  1.24836  1.24832
2020-04-17 16:00:00.871  1.24828  1.24835  1.24832
2020-04-17 16:00:01.780  1.24827  1.24835  1.24831
2020-04-17 16:00:03.467  1.24825  1.24835  1.24830
2020-04-17 16:00:03.471  1.24826  1.24833  1.24830
                         ...      ...      ...
2020-07-17 23:54:58.376  1.25669  1.25694  1.25682
2020-07-17 23:54:58.484  1.25666  1.25691  1.25678
2020-07-17 23:54:58.581  1.25666  1.25690  1.25678
2020-07-17 23:54:59.017  1.25670  1.25696  1.25683
2020-07-17 23:54:59.110  1.25671  1.25696  1.25684

[6278152 rows x 3 columns]

                             bid      ask      mid
time_msc                                          
2020-04-17 16:00:00.148  1.52730  1.52755  1.52742
2020-04-17 16:00:00.334  1.52730  1.52758  1.52744
2020-04-17 16:00:00.540  1.52734  1.52760  1.52747
2020-04-17 16:00:00.845  1.52735  1.52760  1.52748
2020-04-17 16:00:00.936  1.52741  1.52760  1.52750
                         ...      ...      ...
2020-07-17 23:54:58.437  1.55202  1.55241  1.55222
2020-07-17 23:54:58.546  1.55198  1.55237  1.55218
2020-07-17 23:54:58.643  1.55201  1.55239  1.55220
2020-07-17 23:54:58.749  1.55196  1.55234  1.55215
2020-07-17 23:54:59.939  1.55189  1.55233  1.55211

[6542489 rows x 3 columns]

However after resampling with the "1Min" offset, both of them have a different number of Nan values.


print(Asset1Ticks["mid"].resample("1Min").ohlc().close.isna().sum())

print(Asset2Ticks["mid"].resample("1Min").ohlc().close.isna().sum())

37583

37576

Problem description

Both dataframes have the same datetime ranges but somehow, resampling them produces different numbers of Nan values for each.

Additional info

This inconsistency is only observed with the "1Min" resampling offset. There are no inconsistencies for the other resampling offsets.

Expected Output

Since both financial instruments start and stop trading at the same time and their dataframes both have the same datetime ranges, the number of Nan values post resampling should be the same.

Output of pd.show_versions()

commit : None python : 3.8.3.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : en LOCALE : English_Singapore.1252 pandas : 1.0.5 numpy : 1.18.5 pytz : 2020.1 dateutil : 2.8.1 pip : 20.1.1 setuptools : 49.2.0.post20200714 Cython : None pytest : None hypothesis : None sphinx : 3.1.2 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.16.1 pandas_datareader: None bs4 : 4.9.1 bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : 3.2.2 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pytest : None pyxlsb : None s3fs : None scipy : 1.5.0 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : 0.50.1
bashtage commented 4 years ago

They may have different number of nan values if one stopped trading early on some days. IOW there are some 1 minute intervals in one asset that have no trades reported while the other does have a trade in that interval.

LazyTrader commented 4 years ago

True. Closing this issue.