raw_to_Xy doesn't handle gaps in data

gmgreg commented 3 years ago

raw_to_Xy appears to handle regular gaps in data (e.g. weekend days) but cannot handle irregular gaps such as holidays.

When fed trading data similar to the example at https://deepdow.readthedocs.io/en/latest/source/data_loading.html but covering an entire trading year it get out of sync on every holiday. E.g. a Monday that would typically trade but does not on a holiday such as Jan 20, 2020.

The result is that the assertion assert timestamps[0] == raw_df.index[lookback] fails.

This, and likely other data formatting issues, causes an error when executing history = run.launch(30) which is RuntimeError: mat1 and mat2 shapes cannot be multiplied

jankrepl commented 3 years ago

Hey there!

Could you please share some minimal reproducible raw_df that leads to errors? I am not sure what the main problem is. Note that you can remove all potential private information (column names, valid values, etc..).

In general, I encourage you to check the implementation of raw_to_Xy and rewrite it in a way that suits your use case. https://github.com/jankrepl/deepdow/blob/ea894c590d41f1c0ce93679811c11c90e0f74549/deepdow/utils.py#L203

Additionally, check any of the end-to-end examples where the raw_to_Xy was not used and the X, y were created from scratch: https://deepdow.readthedocs.io/en/latest/auto_examples/index.html#end-to-end

gmgreg commented 3 years ago

Thank you for your response; I realize my question may not have been very clear. I took a look at the implementation and noticed raw_to_Xy is calling pandas date_range with freq=B by default (this wasn't clear to me from the documentation).

I believe I've been able to address this particular issue by using:

from pandas.tseries.holiday import USFederalHolidayCalendar
bday_us = pd.offsets.CustomBusinessDay(calendar=USFederalHolidayCalendar())

...and then call raw_to_Xy with that custom frequency...

X, timestamps, y, asset_names, indicators = raw_to_Xy(raw_df,
                                                      lookback=lookback,
                                                      gap=gap,
                                                      horizon=horizon,
                                                      freq=bday_us,
                                                      use_log=True)

I'm still having issues and will provide sample data and additional information.

gmgreg commented 3 years ago

You can use the below code and the attached csv file (sample_raw_df.txt, github does not allow .csv attachments). You'll notice the data has 19 rows (timesteps) and if we use a 5 day lookback, 0 gap, and 1 horizon it should be 14 windowed samples. When running it through raw_to_Xy we end up with 13.

from pandas.tseries.holiday import USFederalHolidayCalendar
bday_us = pd.offsets.CustomBusinessDay(calendar=USFederalHolidayCalendar())
# MLK holiday is around Jan 19th and results in a gap if not accounted for in a custom freq for pandas

raw_df = pd.read_csv('./sample_raw_df.txt',
                     parse_dates = ['Date'], 
                     index_col=['Date'])
raw_df = raw_df.sort_values(by=['Date', 'Ticker'])

raw_df = raw_df.pivot_table(index=['Date'],
                                columns='Ticker',
                                aggfunc='sum',
                                fill_value=0).swaplevel(axis=1).sort_index(1)

assert isinstance(raw_df.columns, pd.MultiIndex)
assert isinstance(raw_df.index, pd.DatetimeIndex)

n_timesteps = len(raw_df)  # 19
n_channels = len(raw_df.columns.levels[1])  # 5
n_assets = len(raw_df.columns.levels[0])  # 2

lookback, gap, horizon = 5, 0, 1

X, timestamps, y, asset_names, indicators = raw_to_Xy(raw_df,
                                                      lookback=lookback,
                                                      gap=gap,
                                                      horizon=horizon,
                                                      freq=bday_us,
                                                      use_log=True)

n_samples =  n_timesteps - lookback - horizon - gap + 1  # 14

print(f'Timesteps: {n_timesteps}, Samples: {n_samples}, X.shape {X.shape}')

assert timestamps[0] == raw_df.index[lookback]
assert X.shape == (n_samples, n_channels, lookback, n_assets) # X.shape: (13, 5, 5, 2), should be (14, 5, 5, 2)
assert asset_names == list(raw_df.columns.levels[0])
assert indicators == list(raw_df.columns.levels[1])

sample_raw_df.txt

jankrepl commented 3 years ago

Thank you for the example!

I would guess that the thing that confused you (I blame the documentation, see #72 for a fix) is that the true value of n_samples is not always equal to len(raw_df) - lookback - horizon - gap + 1. It worked out that way in the documentation, however, if there was a different number of missing timestamps in the raw_df or a different freq it could be a totally different number.

The raw_to_Xy creates its own DateTimeIndex in the following way (see code for more details):

index = pd.date_range(start=raw_data.index[0], end=raw_data.index[-1], freq=freq)

So it does not really matter what happens in between the start and the end timestamp - the new index is just generated from scratch based on the frequency and the end points. In your example, you changed the frequency to a custom one

index_custom = pd.date_range(start=raw_df.index[0], end=raw_df.index[-1], freq=bday_us)
index_default =  pd.date_range(start=raw_df.index[0], end=raw_df.index[-1], freq='B')

print(len(index_custom), len(index_default), set(index_default) - set(index_custom))

19 20 {Timestamp('2016-01-18 00:00:00', freq='B')}

That means that just by providing your custom index you will lose 1 sample with respect to the default one.

You'll notice the data has 19 rows (timesteps) and if we use a 5 day lookback, 0 gap, and 1 horizon it should be 14 windowed samples. When running it through raw_to_Xy we end up with 13.

I think you forgot to factor in the fact that the raw_to_Xy function actually computes 1-step returns in the background, so the first time step is deleted (see code)

gmgreg commented 3 years ago

Thanks again for the feedback. As background just wanted to quickly test out deepdow with a limited dataset so was following the getting_started.ipynb notebook and simply replacing the generated data with a sampling of my own closer to the format noted in Data Loading.

I'm used to creating windowed training datasets as is typical for LSTM. E.g. 3D numpy arrays with samples, lookback, features, and the matching target array (y). Using a toy dataset fed to raw_to_Xy caused several assertions to fail which I mistook as critical.

I think it may be easier to take your earlier advice and create X and y from scratch. Looking at the generated data in the end-to-end examples is a start tough it's only a single feature (channel).

At this point I've still not been able to get a toy dataset successfully trained (currently seeing a RuntimeError: mat1 and mat2 shapes cannot be multiplied error no doubt due to something wrong in the dataset I'm loading).

Thanks for your patience.

gmgreg commented 3 years ago

After more experimentation the relationship between the dataset shape and the network is now more clear. I had assumed the dataset and network were generic but now I see the different networks expect different dataset shapes (e.g. number of channels). I has assumed the errors I was seeing when attempting to train was due to something in my dataset construction. In actuality it was a mis-match between what the network was expecting (e.g. 1 channel or multiple channels) and what I was feeding it.

jankrepl commented 3 years ago

Well, I hope you managed to do what you wanted! Feel free to ask any other questions at any point!

Cheers!

jankrepl / deepdow

raw_to_Xy doesn't handle gaps in data #71