Closed gmgreg closed 3 years ago
Hey there!
Could you please share some minimal reproducible raw_df
that leads to errors? I am not sure what the main problem is. Note that you can remove all potential private information (column names, valid values, etc..).
In general, I encourage you to check the implementation of raw_to_Xy
and rewrite it in a way that suits your use case.
https://github.com/jankrepl/deepdow/blob/ea894c590d41f1c0ce93679811c11c90e0f74549/deepdow/utils.py#L203
Additionally, check any of the end-to-end examples where the raw_to_Xy
was not used and the X
, y
were created from scratch: https://deepdow.readthedocs.io/en/latest/auto_examples/index.html#end-to-end
Thank you for your response; I realize my question may not have been very clear. I took a look at the implementation and noticed raw_to_Xy
is calling pandas date_range with freq=B
by default (this wasn't clear to me from the documentation).
I believe I've been able to address this particular issue by using:
from pandas.tseries.holiday import USFederalHolidayCalendar
bday_us = pd.offsets.CustomBusinessDay(calendar=USFederalHolidayCalendar())
...and then call raw_to_Xy with that custom frequency...
X, timestamps, y, asset_names, indicators = raw_to_Xy(raw_df,
lookback=lookback,
gap=gap,
horizon=horizon,
freq=bday_us,
use_log=True)
I'm still having issues and will provide sample data and additional information.
You can use the below code and the attached csv file (sample_raw_df.txt, github does not allow .csv attachments). You'll notice the data has 19 rows (timesteps) and if we use a 5 day lookback, 0 gap, and 1 horizon it should be 14 windowed samples. When running it through raw_to_Xy we end up with 13.
from pandas.tseries.holiday import USFederalHolidayCalendar
bday_us = pd.offsets.CustomBusinessDay(calendar=USFederalHolidayCalendar())
# MLK holiday is around Jan 19th and results in a gap if not accounted for in a custom freq for pandas
raw_df = pd.read_csv('./sample_raw_df.txt',
parse_dates = ['Date'],
index_col=['Date'])
raw_df = raw_df.sort_values(by=['Date', 'Ticker'])
raw_df = raw_df.pivot_table(index=['Date'],
columns='Ticker',
aggfunc='sum',
fill_value=0).swaplevel(axis=1).sort_index(1)
assert isinstance(raw_df.columns, pd.MultiIndex)
assert isinstance(raw_df.index, pd.DatetimeIndex)
n_timesteps = len(raw_df) # 19
n_channels = len(raw_df.columns.levels[1]) # 5
n_assets = len(raw_df.columns.levels[0]) # 2
lookback, gap, horizon = 5, 0, 1
X, timestamps, y, asset_names, indicators = raw_to_Xy(raw_df,
lookback=lookback,
gap=gap,
horizon=horizon,
freq=bday_us,
use_log=True)
n_samples = n_timesteps - lookback - horizon - gap + 1 # 14
print(f'Timesteps: {n_timesteps}, Samples: {n_samples}, X.shape {X.shape}')
assert timestamps[0] == raw_df.index[lookback]
assert X.shape == (n_samples, n_channels, lookback, n_assets) # X.shape: (13, 5, 5, 2), should be (14, 5, 5, 2)
assert asset_names == list(raw_df.columns.levels[0])
assert indicators == list(raw_df.columns.levels[1])
Thank you for the example!
I would guess that the thing that confused you (I blame the documentation, see #72 for a fix) is that the true value of n_samples
is not always equal to len(raw_df) - lookback - horizon - gap + 1
. It worked out that way in the documentation, however, if there was a different number of missing timestamps in the raw_df
or a different freq
it could be a totally different number.
The raw_to_Xy
creates its own DateTimeIndex
in the following way (see code for more details):
index = pd.date_range(start=raw_data.index[0], end=raw_data.index[-1], freq=freq)
So it does not really matter what happens in between the start and the end timestamp - the new index is just generated from scratch based on the frequency and the end points. In your example, you changed the frequency to a custom one
index_custom = pd.date_range(start=raw_df.index[0], end=raw_df.index[-1], freq=bday_us)
index_default = pd.date_range(start=raw_df.index[0], end=raw_df.index[-1], freq='B')
print(len(index_custom), len(index_default), set(index_default) - set(index_custom))
19 20 {Timestamp('2016-01-18 00:00:00', freq='B')}
That means that just by providing your custom index you will lose 1 sample with respect to the default one.
You'll notice the data has 19 rows (timesteps) and if we use a 5 day lookback, 0 gap, and 1 horizon it should be 14 windowed samples. When running it through raw_to_Xy we end up with 13.
I think you forgot to factor in the fact that the raw_to_Xy
function actually computes 1-step returns in the background, so the first time step is deleted (see code)
Thanks again for the feedback. As background just wanted to quickly test out deepdow with a limited dataset so was following the getting_started.ipynb notebook and simply replacing the generated data with a sampling of my own closer to the format noted in Data Loading.
I'm used to creating windowed training datasets as is typical for LSTM. E.g. 3D numpy arrays with samples, lookback, features, and the matching target array (y). Using a toy dataset fed to raw_to_Xy
caused several assertions to fail which I mistook as critical.
I think it may be easier to take your earlier advice and create X and y from scratch. Looking at the generated data in the end-to-end examples is a start tough it's only a single feature (channel).
At this point I've still not been able to get a toy dataset successfully trained (currently seeing a RuntimeError: mat1 and mat2 shapes cannot be multiplied
error no doubt due to something wrong in the dataset I'm loading).
Thanks for your patience.
After more experimentation the relationship between the dataset shape and the network is now more clear. I had assumed the dataset and network were generic but now I see the different networks expect different dataset shapes (e.g. number of channels). I has assumed the errors I was seeing when attempting to train was due to something in my dataset construction. In actuality it was a mis-match between what the network was expecting (e.g. 1 channel or multiple channels) and what I was feeding it.
Well, I hope you managed to do what you wanted! Feel free to ask any other questions at any point!
Cheers!
raw_to_Xy appears to handle regular gaps in data (e.g. weekend days) but cannot handle irregular gaps such as holidays.
When fed trading data similar to the example at https://deepdow.readthedocs.io/en/latest/source/data_loading.html but covering an entire trading year it get out of sync on every holiday. E.g. a Monday that would typically trade but does not on a holiday such as Jan 20, 2020.
The result is that the assertion
assert timestamps[0] == raw_df.index[lookback]
fails.This, and likely other data formatting issues, causes an error when executing
history = run.launch(30)
which isRuntimeError: mat1 and mat2 shapes cannot be multiplied