Open timothyyu opened 5 years ago
this is directly related to #5
a dual-stage approach is probably required due to the nature of the dataset (complex hierarchical time series with numerical features), so a simple transformation is not going to work/going to make the 2nd stage of the model more inaccurate
Copied from #5 (dataset scaling/normalization before wavelet transform):
The author of DeepLearning_Financial
decided to forgo automated scaling/normalization and instead scaled the input features/dataset manually before applying the wavelet transform:
# This is a scaling of the inputs such that they are in an appropriate range
feats["Close Price"].loc[:] = feats["Close Price"].loc[:]/1000
feats["Open Price"].loc[:] = feats["Open Price"].loc[:]/1000
feats["High Price"].loc[:] = feats["High Price"].loc[:]/1000
feats["Low Price"].loc[:] = feats["Low Price"].loc[:]/1000
feats["Volume"].loc[:] = feats["Volume"].loc[:]/1000000
feats["MACD"].loc[:] = feats["MACD"].loc[:]/10
feats["CCI"].loc[:] = feats["CCI"].loc[:]/100
feats["ATR"].loc[:] = feats["ATR"].loc[:]/100
feats["BOLL"].loc[:] = feats["BOLL"].loc[:]/1000
feats["EMA20"].loc[:] = feats["EMA20"].loc[:]/1000
feats["MA10"].loc[:] = feats["MA10"].loc[:]/1000
feats["MTM6"].loc[:] = feats["MTM6"].loc[:]/100
feats["MA5"].loc[:] = feats["MA5"].loc[:]/1000
feats["MTM12"].loc[:] = feats["MTM12"].loc[:]/100
feats["ROC"].loc[:] = feats["ROC"].loc[:]/10
feats["SMI"].loc[:] = feats["SMI"].loc[:] * 10
feats["WVAD"].loc[:] = feats["WVAD"].loc[:]/100000000
feats["US Dollar Index"].loc[:] = feats["US Dollar Index"].loc[:]/100
feats["Federal Fund Rate"].loc[:] = feats["Federal Fund Rate"].loc[:]
# REMOVED THE NORMALIZATION AND MANUALLY SCALED TO APPROPRIATE VALUES ABOVE
"""
scaler = StandardScaler().fit(feats_train)
feats_norm_train = scaler.transform(feats_train)
feats_norm_validate = scaler.transform(feats_validate)
feats_norm_test = scaler.transform(feats_test)
"""
"""
scaler = MinMaxScaler(feature_range=(0,1))
scaler.fit(feats_train)
feats_norm_train = scaler.transform(feats_train)
feats_norm_validate = scaler.transform(feats_validate)
feats_norm_test = scaler.transform(feats_test)
"""
My my main issues/concerns are the following:
Thus:
More research needed on scaling/normalization in respect/context of time series data for machine learning. In terms of code/practical implementation, I will mostly like code multiple options for different training runs (with different scaling/normalization options), and then compare.
RobustScaler test for 'nifty 50 index data'
:
see commit https://github.com/timothyyu/wsae-lstm/commit/8073c426f903611f7ec22043d4b5378054b2904b
scaled data and scaled denoised data now saved in data/interim folder:
pdf output of train-validate-test split scaled + denoised in reports folder:
Excerpt from pdf output:
train-validate-test split is showing some questionable output; look into when I get a chance
there is the possibility it's a matplotlib/pdf render output issue: https://github.com/timothyyu/wsae-lstm/blob/master/reports/djia%20index%20data%20tvt%20split%20scale%20denoise%20visual.pdf
this should not be happening:
yup, this definitely should not be happening- values shouldn't be negative with hard
thresholding to the point where they throw off the rest of the features/input data::
upon closer examination, it appears that the line/feature flatling visually is not technically wrong for the djia index
dataset:
The values are still there, but RobustScaler
needs some adjustment
Hi @timothyyu, I can't understand the structure of your repository quite well. What directories subrepos and wsae_lstm stand for?
@mg64ve the subrepos
directory contains deeplearning_financial
for reference; wsae_lstm
is the main source location for the code for my implementation. I am using the following directory structure, but with wsae_lstm
instead of src
:
http://drivendata.github.io/cookiecutter-data-science/#directory-structure
Generally I will test or refine my implementation in a Jupyter Notebook in the notebooks
folder, and then refine code from those notebooks into python files, which go under wsae_lstm
. Jupyter notebooks are not exactly reproducible - that is part of the reason why I'm not doing everything in a Jupyter environment.
@timothyyu thanks. I see there are many notebooks in the archive directory. What is the reason for that?
Archived notebooks are not "current" to the latest commit - usually anything in the archived directory has been implemented in python under wsae_lstm
. My general development process uses jupyter notebooks to explore and rapidly prototype, which means things will break frequently and often.
I know it's not exactly ideal, but this particular type of workflow allows me to rapidly prototype and develop while leveraging the data exploration and visualization tools provided by Anaconda + Jupyter Notebook, and then refine that exploration/visualization/prototype into something that can be reproduced by anyone that clones or forks the repository (i.e. the files under wsae_lstm
).
Jupyter notebooks are great for visual analysis and exploration, but terrible for reproducible results, consistency, and development. For more context, see the following articles/threads:
Ok @timothyyu, but in the active notebooks, you read data from ../data/interim/cdii_tvt_split.pickle How did you make this data? I can't understand where is the starting point.
@mg64ve
train-validate-test
intervals:
#print(dict_dataframes_index.keys())
# [index data][period 1-24][train/validate/test]
# Train [1], Validate [2], Test [3]
Every step of the process is saved in data/interim folder: https://github.com/timothyyu/wsae-lstm/tree/master/data/interim
Functions that are used to clean and split the dataset are in wsae-lstm/utils.py
:
https://github.com/timothyyu/wsae-lstm/blob/master/wsae_lstm/utils.py
The function used to generate the report output in the reports
folder for the train-validate-test
split is wsae-lstm/visualize.py
:
https://github.com/timothyyu/wsae-lstm/blob/master/wsae_lstm/visualize.py
https://github.com/timothyyu/wsae-lstm/tree/master/reports
ok @timothyyu but what you call raw data is .xls with several indicators. From where do you take this data? for my understanding raw data is OHLC+volume
The raw data in the data/raw
folder is straight from the source - it is the dataset that the authors of the WSAE-LSTM model journal/paper link and use themselves. Specifically, the raw data is obtained from the following link:
https://figshare.com/articles/Raw_Data/5028110
DOI:10.6084/m9.figshare.5028110
The source journal, "A deep learning framework for financial time series using stacked autoencoders and long-short term memory" (Bao et al., 2017) describes that their raw data is not just OHLC + volume, but an assortment of technical indicators and macroeconomic variables added to the OHLC + volume data:
Source: "Table 1. Description of the input variables." (Bao et al., 2017)
Ok @timothyyu I did not know they made this data available. I know that document very well and I also recommend to read https://www.researchgate.net/publication/329316403_Recurrent_Neural_Networks_for_Financial_Time-Series_Modelling which seems to be very interesting. What do you think of the two documents? What is the purpose of your research? Do you want to replicate their result starting from same data or do you want to explore if this concept is applicable to streaming data series? In the second case you need to start from OHLC market data. What is the purpose of clean_dataset.py ? I see you are basically adjusting columns in the dataset since the dataset seems to be already not containing null values or gaps. right?
@timothyyu got similar results processing data with R. Left is raw data, on the right you have after preprocessing with HAAR and SURE Shrinking after normalization.
Interesting - scaling the indicators separately from the the OHLC is something I'm going to look into once I'm further along constructing the rest of the model. Additionally, I'm almost sure values from the wavelet transform have be saved from the train sets to apply to the validate and test sets, but there are some limitations/issues regarding that (see: https://github.com/timothyyu/wsae-lstm/issues/6#issuecomment-469413885)
Ideally, I'd like see if this kind of hybrid model is viable before applying it to a streaming series.
This is possible by saving the scaling parameters:
However, the same is not true for the denoise with the wavelet transform if the values for sigma
are different:
# calculate the wavelet coefficients
coeff = pywt.wavedec( x, wavelet, mode='periodization',level=declevel,axis=0 )
# calculate a threshold
sigma = mad(coeff[-level])
#print("sigma: ",sigma)
uthresh = sigma * np.sqrt( 2*np.log( len( x ) ) )
coeff[1:] = ( pywt.threshold( i, value=uthresh, mode="hard" ) for i in coeff[1:] )
# reconstruct the signal using the thresholded coefficients
y = pywt.waverec( coeff, wavelet, mode='periodization',axis=0 )
return y,sigma,uthresh
There is more than one way to approach this issue - it is a multifaceted issue that will affect the rest of the model + results.
@timothyyu I don't think we need to be concerned about the reverse process. The following are some snapshot after the SAE process. I am using s8 wavelet and SURE thresholding. I can't really understand what I should expect after SAE. Should I expect some features fusion or some features diversification?
@timothyyu I don't think we need to be concerned about the reverse process. ... I can't really understand what I should expect after SAE. Should I expect some features fusion or some features diversification?
That is something I am still looking into - how the the output from the SAE layers into the LSTM section is handled. I am not yet at this stage in replicating the results of the paper; so I can't fully answer your question as of this time.
@mg64ve also see https://github.com/timothyyu/wsae-lstm/issues/6
There are potential issues with how sigma
and uthresh
values are used for the wavelet transform that I am looking into
Partial/incomplete answer to your question about the reverse process:
If the LSTMs are trained on scaled OHLC data, then the predictions will be scaled. If process is not reversible (even if approximate if not exact), then output from the LSTMs is going to be unintelligible nonsense:
There is a possibility that the LSTMs are fed scaled and denoised indicator data, but the OHLC data is denoised and not scaled - it's fairly complex is think about in terms of a pipeline:
@timothyyu I don't know if really necessary. If you look at Gavin Tsang document, he "normalised based only upon the minimum/maximum values of their corresponding training set in order to eliminate any prior knowledge of overall scale as would occur in real-time prediction". If you do this, you should evaluate if the prediction is greater/lower than the previous value.
see comment on #9: https://github.com/timothyyu/wsae-lstm/issues/9#issuecomment-511061074
Relevant removed post/comment from r/algotrading that references this paper/echoes what i've found so far in attempting to replicate the model: https://www.removeddit.com/r/algotrading/comments/cr7jey/ive_reproduced_130_research_papers_about/
The most frustrating paper:
I have true hate for the authors of this paper: "A deep learning framework for financial time series using stacked autoencoders and long-short term memory". Probably the most complex AND vague in terms of methodology and after weeks trying to reproduce their results (and failing) I figured out that they were leaking future data into their training set (this also happens more than you'd think).
The two positive take-aways that I did find from all of this research are:
Almost every instrument is mean-reverting on short timelines and trending on longer timelines. This has held true across most of the data that I tested. Putting this information into a strategy would be rather easy and straightforward (although you have no guarantee that it'll continue to work in future). When we were in the depths of the great recession, almost every signal was bearish (seeking alpha contributors, news, google trends). If this holds in the next recession, just using this data alone would give you a strategy that vastly outperforms the index across long time periods.
I agree many papers they do not considers many aspects or they contain look-ahead bias. I think they should publish their code so everybody can check if the results are made with code containing leakage. I wonder to read this paper:
https://ieeexplore.ieee.org/document/8280883
But I don't have access. Do you think it could contain bias?
@mg64ve here's the paper, I haven't had a chance to go through it yet but I'll be including it under references
in future commits:
li2017.pdf
Z. Li and V. Tam, "Combining the real-time wavelet denoising and long-short-term-memory neural network for predicting stock indexes," *2017 IEEE Symposium Series on Computational Intelligence (SSCI)*, Honolulu, HI, 2017, pp. 1-8.
doi: 10.1109/SSCI.2017.8280883
Careful attention is required for proper scaling/normalization of the Panel B and Panel C indicators in relation to the OHLC data (Panel A). When visualizing the train-validate-split with matplotlib, some of the index types show two lines or more, which shouldn't be possible, or at least that is what I thought was the case (I was, very, very wrong):
It turns out the panel C indicators and panel B indicators for the other index types are so far out range that they flatten the other features in terms of visualization/plotting: