dual stage normalization and scaling

timothyyu commented 5 years ago

Careful attention is required for proper scaling/normalization of the Panel B and Panel C indicators in relation to the OHLC data (Panel A). When visualizing the train-validate-split with matplotlib, some of the index types show two lines or more, which shouldn't be possible, or at least that is what I thought was the case (I was, very, very wrong):

It turns out the panel C indicators and panel B indicators for the other index types are so far out range that they flatten the other features in terms of visualization/plotting:

timothyyu commented 5 years ago

this is directly related to #5

a dual-stage approach is probably required due to the nature of the dataset (complex hierarchical time series with numerical features), so a simple transformation is not going to work/going to make the 2nd stage of the model more inaccurate

timothyyu commented 5 years ago

Copied from #5 (dataset scaling/normalization before wavelet transform):

The author of DeepLearning_Financial decided to forgo automated scaling/normalization and instead scaled the input features/dataset manually before applying the wavelet transform:

https://github.com/mlpanda/DeepLearning_Financial/blob/7e846144629d8b49b8fd74a87d5ff047b7af55d1/run_training.py#L55 :

 # This is a scaling of the inputs such that they are in an appropriate range    
    feats["Close Price"].loc[:] = feats["Close Price"].loc[:]/1000
    feats["Open Price"].loc[:] = feats["Open Price"].loc[:]/1000
    feats["High Price"].loc[:] = feats["High Price"].loc[:]/1000
    feats["Low Price"].loc[:] = feats["Low Price"].loc[:]/1000
    feats["Volume"].loc[:] = feats["Volume"].loc[:]/1000000
    feats["MACD"].loc[:] = feats["MACD"].loc[:]/10
    feats["CCI"].loc[:] = feats["CCI"].loc[:]/100
    feats["ATR"].loc[:] = feats["ATR"].loc[:]/100
    feats["BOLL"].loc[:] = feats["BOLL"].loc[:]/1000
    feats["EMA20"].loc[:] = feats["EMA20"].loc[:]/1000
    feats["MA10"].loc[:] = feats["MA10"].loc[:]/1000
    feats["MTM6"].loc[:] = feats["MTM6"].loc[:]/100
    feats["MA5"].loc[:] = feats["MA5"].loc[:]/1000
    feats["MTM12"].loc[:] = feats["MTM12"].loc[:]/100
    feats["ROC"].loc[:] = feats["ROC"].loc[:]/10
    feats["SMI"].loc[:] = feats["SMI"].loc[:] * 10
    feats["WVAD"].loc[:] = feats["WVAD"].loc[:]/100000000
    feats["US Dollar Index"].loc[:] = feats["US Dollar Index"].loc[:]/100
    feats["Federal Fund Rate"].loc[:] = feats["Federal Fund Rate"].loc[:]

https://github.com/mlpanda/DeepLearning_Financial/blob/7e846144629d8b49b8fd74a87d5ff047b7af55d1/run_training.py#L96 :

 # REMOVED THE NORMALIZATION AND MANUALLY SCALED TO APPROPRIATE VALUES ABOVE

    """
    scaler = StandardScaler().fit(feats_train)
    feats_norm_train = scaler.transform(feats_train)
    feats_norm_validate = scaler.transform(feats_validate)
    feats_norm_test = scaler.transform(feats_test)
    """
    """
    scaler = MinMaxScaler(feature_range=(0,1))
    scaler.fit(feats_train)
    feats_norm_train = scaler.transform(feats_train)
    feats_norm_validate = scaler.transform(feats_validate)
    feats_norm_test = scaler.transform(feats_test)
    """

My my main issues/concerns are the following:

Manual scaling can work when you know the exact range of the dataset you're going to be working with, but this kind of scaling would not work on a live model (whether online or continuously batch-trained). In this case, a few values outside of the defined manual ranges for OHLC and the rest of the Panel B Technical Indicators would throw the scaling off.
The source article/journal (Bao et al., 2017) does not go into detail about preprocessing their dataset beyond using the wavelet transform to denoise the dataset.
Scaling != normalization, and there are different ways to scale and/or normalize data depending on the nature of the problem and model (and the nature of the dataset itself)

Thus:

More research needed on scaling/normalization in respect/context of time series data for machine learning. In terms of code/practical implementation, I will mostly like code multiple options for different training runs (with different scaling/normalization options), and then compare.

timothyyu commented 5 years ago

RobustScaler test for 'nifty 50 index data':

timothyyu commented 5 years ago

see commit https://github.com/timothyyu/wsae-lstm/commit/8073c426f903611f7ec22043d4b5378054b2904b

scaled data and scaled denoised data now saved in data/interim folder:

pdf output of train-validate-test split scaled + denoised in reports folder:

Excerpt from pdf output:

timothyyu commented 5 years ago

train-validate-test split is showing some questionable output; look into when I get a chance

there is the possibility it's a matplotlib/pdf render output issue: https://github.com/timothyyu/wsae-lstm/blob/master/reports/djia%20index%20data%20tvt%20split%20scale%20denoise%20visual.pdf

this should not be happening:

yup, this definitely should not be happening- values shouldn't be negative with hard thresholding to the point where they throw off the rest of the features/input data::

timothyyu commented 5 years ago

upon closer examination, it appears that the line/feature flatling visually is not technically wrong for the djia index dataset:

The values are still there, but RobustScaler needs some adjustment

mg64ve commented 5 years ago

Hi @timothyyu, I can't understand the structure of your repository quite well. What directories subrepos and wsae_lstm stand for?

timothyyu commented 5 years ago

@mg64ve the subrepos directory contains deeplearning_financial for reference; wsae_lstm is the main source location for the code for my implementation. I am using the following directory structure, but with wsae_lstm instead of src: http://drivendata.github.io/cookiecutter-data-science/#directory-structure

Generally I will test or refine my implementation in a Jupyter Notebook in the notebooks folder, and then refine code from those notebooks into python files, which go under wsae_lstm. Jupyter notebooks are not exactly reproducible - that is part of the reason why I'm not doing everything in a Jupyter environment.

mg64ve commented 5 years ago

@timothyyu thanks. I see there are many notebooks in the archive directory. What is the reason for that?

timothyyu commented 5 years ago

Archived notebooks are not "current" to the latest commit - usually anything in the archived directory has been implemented in python under wsae_lstm. My general development process uses jupyter notebooks to explore and rapidly prototype, which means things will break frequently and often.

I know it's not exactly ideal, but this particular type of workflow allows me to rapidly prototype and develop while leveraging the data exploration and visualization tools provided by Anaconda + Jupyter Notebook, and then refine that exploration/visualization/prototype into something that can be reproduced by anyone that clones or forks the repository (i.e. the files under wsae_lstm).

Jupyter notebooks are great for visual analysis and exploration, but terrible for reproducible results, consistency, and development. For more context, see the following articles/threads:

mg64ve commented 5 years ago

Ok @timothyyu, but in the active notebooks, you read data from ../data/interim/cdii_tvt_split.pickle How did you make this data? I can't understand where is the starting point.

timothyyu commented 5 years ago

@mg64ve

First the raw data is cleaned and then split into train-validate-test intervals:
- https://github.com/timothyyu/wsae-lstm/blob/master/wsae_lstm/dataset/clean_dataset.py
- https://github.com/timothyyu/wsae-lstm/blob/master/wsae_lstm/dataset/split_dataset.py

#print(dict_dataframes_index.keys())
# [index data][period 1-24][train/validate/test]
    # Train [1], Validate [2], Test [3]

Then the data is scaled and then denoised:

Every step of the process is saved in data/interim folder: https://github.com/timothyyu/wsae-lstm/tree/master/data/interim

timothyyu commented 5 years ago

Functions that are used to clean and split the dataset are in wsae-lstm/utils.py: https://github.com/timothyyu/wsae-lstm/blob/master/wsae_lstm/utils.py

The function used to generate the report output in the reports folder for the train-validate-test split is wsae-lstm/visualize.py: https://github.com/timothyyu/wsae-lstm/blob/master/wsae_lstm/visualize.py https://github.com/timothyyu/wsae-lstm/tree/master/reports

mg64ve commented 5 years ago

ok @timothyyu but what you call raw data is .xls with several indicators. From where do you take this data? for my understanding raw data is OHLC+volume

timothyyu commented 5 years ago

The raw data in the data/raw folder is straight from the source - it is the dataset that the authors of the WSAE-LSTM model journal/paper link and use themselves. Specifically, the raw data is obtained from the following link: https://figshare.com/articles/Raw_Data/5028110 DOI:10.6084/m9.figshare.5028110

The source journal, "A deep learning framework for financial time series using stacked autoencoders and long-short term memory" (Bao et al., 2017) describes that their raw data is not just OHLC + volume, but an assortment of technical indicators and macroeconomic variables added to the OHLC + volume data:

Source: "Table 1. Description of the input variables." (Bao et al., 2017)

mg64ve commented 5 years ago

Ok @timothyyu I did not know they made this data available. I know that document very well and I also recommend to read https://www.researchgate.net/publication/329316403_Recurrent_Neural_Networks_for_Financial_Time-Series_Modelling which seems to be very interesting. What do you think of the two documents? What is the purpose of your research? Do you want to replicate their result starting from same data or do you want to explore if this concept is applicable to streaming data series? In the second case you need to start from OHLC market data. What is the purpose of clean_dataset.py ? I see you are basically adjusting columns in the dataset since the dataset seems to be already not containing null values or gaps. right?

mg64ve commented 5 years ago

@timothyyu got similar results processing data with R. Left is raw data, on the right you have after preprocessing with HAAR and SURE Shrinking after normalization.

wsae1

wsae2

timothyyu commented 5 years ago

Interesting - scaling the indicators separately from the the OHLC is something I'm going to look into once I'm further along constructing the rest of the model. Additionally, I'm almost sure values from the wavelet transform have be saved from the train sets to apply to the validate and test sets, but there are some limitations/issues regarding that (see: https://github.com/timothyyu/wsae-lstm/issues/6#issuecomment-469413885)

Ideally, I'd like see if this kind of hybrid model is viable before applying it to a streaming series.

timothyyu commented 5 years ago

This is possible by saving the scaling parameters:

However, the same is not true for the denoise with the wavelet transform if the values for sigma are different:

    # calculate the wavelet coefficients
    coeff = pywt.wavedec( x, wavelet, mode='periodization',level=declevel,axis=0 )
    # calculate a threshold
    sigma = mad(coeff[-level])
    #print("sigma: ",sigma)
    uthresh = sigma * np.sqrt( 2*np.log( len( x ) ) )
    coeff[1:] = ( pywt.threshold( i, value=uthresh, mode="hard" ) for i in coeff[1:] )
    # reconstruct the signal using the thresholded coefficients
    y = pywt.waverec( coeff, wavelet, mode='periodization',axis=0 )
    return y,sigma,uthresh

There is more than one way to approach this issue - it is a multifaceted issue that will affect the rest of the model + results.

mg64ve commented 5 years ago

@timothyyu I don't think we need to be concerned about the reverse process. The following are some snapshot after the SAE process. I am using s8 wavelet and SURE thresholding. I can't really understand what I should expect after SAE. Should I expect some features fusion or some features diversification?

wsae-sae20_209

wsae-sae20_559

wsae-sae20_459

timothyyu commented 5 years ago

@timothyyu I don't think we need to be concerned about the reverse process. ... I can't really understand what I should expect after SAE. Should I expect some features fusion or some features diversification?

That is something I am still looking into - how the the output from the SAE layers into the LSTM section is handled. I am not yet at this stage in replicating the results of the paper; so I can't fully answer your question as of this time.

timothyyu commented 5 years ago

@mg64ve also see https://github.com/timothyyu/wsae-lstm/issues/6

There are potential issues with how sigma and uthresh values are used for the wavelet transform that I am looking into

timothyyu commented 5 years ago

Partial/incomplete answer to your question about the reverse process:

If the LSTMs are trained on scaled OHLC data, then the predictions will be scaled. If process is not reversible (even if approximate if not exact), then output from the LSTMs is going to be unintelligible nonsense:

There is a possibility that the LSTMs are fed scaled and denoised indicator data, but the OHLC data is denoised and not scaled - it's fairly complex is think about in terms of a pipeline:

mg64ve commented 5 years ago

@timothyyu I don't know if really necessary. If you look at Gavin Tsang document, he "normalised based only upon the minimum/maximum values of their corresponding training set in order to eliminate any prior knowledge of overall scale as would occur in real-time prediction". If you do this, you should evaluate if the prediction is greater/lower than the previous value.

timothyyu commented 5 years ago

see comment on #9: https://github.com/timothyyu/wsae-lstm/issues/9#issuecomment-511061074

timothyyu commented 5 years ago

Relevant removed post/comment from r/algotrading that references this paper/echoes what i've found so far in attempting to replicate the model: https://www.removeddit.com/r/algotrading/comments/cr7jey/ive_reproduced_130_research_papers_about/

The most frustrating paper:

I have true hate for the authors of this paper: "A deep learning framework for financial time series using stacked autoencoders and long-short term memory". Probably the most complex AND vague in terms of methodology and after weeks trying to reproduce their results (and failing) I figured out that they were leaking future data into their training set (this also happens more than you'd think).

The two positive take-aways that I did find from all of this research are:

Almost every instrument is mean-reverting on short timelines and trending on longer timelines. This has held true across most of the data that I tested. Putting this information into a strategy would be rather easy and straightforward (although you have no guarantee that it'll continue to work in future). When we were in the depths of the great recession, almost every signal was bearish (seeking alpha contributors, news, google trends). If this holds in the next recession, just using this data alone would give you a strategy that vastly outperforms the index across long time periods.

mg64ve commented 5 years ago

I agree many papers they do not considers many aspects or they contain look-ahead bias. I think they should publish their code so everybody can check if the results are made with code containing leakage. I wonder to read this paper:

https://ieeexplore.ieee.org/document/8280883

But I don't have access. Do you think it could contain bias?

timothyyu commented 5 years ago

@mg64ve here's the paper, I haven't had a chance to go through it yet but I'll be including it under references in future commits: li2017.pdf

Z. Li and V. Tam, "Combining the real-time wavelet denoising and long-short-term-memory neural network for predicting stock indexes," *2017 IEEE Symposium Series on Computational Intelligence (SSCI)*, Honolulu, HI, 2017, pp. 1-8.
doi: 10.1109/SSCI.2017.8280883

timothyyu / wsae-lstm

dual stage normalization and scaling #7