How can we have 9450 ECG data in total when UCR provides only 5000? + 2 other questions

hellojinwoo commented 5 years ago

Hello tejaslodaya, I have 3 questions regarding the codes, which I would appreciate it if you can answer.

Q1. How did you train the VRAE with 9500 data points (8500 train+ 950 test), when the UCR archive has only 5000 data points in total?

According to the UCR time-series archive website , the ECG5000 dataset is comprised of 500 train dataset and 4500 test dataset. But your readme.md says as follows:

The above network is trained on a dataset of 8500 ECG's and tested on 950 ECG's Named ECG5000 on the UCR archive, this dataset has 5 classes....

Where is the difference of data points (4500 data points difference) between your dataset and UCR archive coming from?

Q2. For the clustering purpose, which would be the good way to slice the daily_return time-series vector of stocks to train the VRAE?

Task : To replace the covariance of daily_return time-series vector of 2 stocks with the distance calculated from the VRAE.
Question : should I divide the time-series data into non-overlapping sequences or over-lapping sequences with each other?

cf) You can calculate the daily return (Ri) as the picture shown below.

daily_stock

Let's say I have a daily_return time-series data of Apple stock (AAPL) from 2010/1/1 to 2018/12/31. It is almost 2000-dimensional 1D vector. What would be the best way to slice this 2000-dimensional vector?

The research paper Variational Recurrent Auto-encodertried both ways: dividing with overlapping parts and without over-lapping parts

dividing without overlapping parts

...The song were divided into non-overlapping sequences of 50 time steps each....
dividing with overlapping parts

...For this model, we used sequences of 40 time steps with overlap, such that the start of each data point is halfway through the previous data point.

Since I am not interested in generation, which is the usual purpose of VAE, I am wondering which way I should follow. My goal is to cluster stocks based on daily_return time-series vector. Any advice on how to slice the stock daily return time-series data would be very much appreciated!

Q3. Why didn't you follow the author's way of slicing the time-series data?

As far as I know, the ECG 5000 data does not overlap with each other, but you used the ECG data to train the VRAE model. Does it imply that VRAE can be trained with non-overlapping data?

Thank you for reading the questions.

kayuksel commented 4 years ago

@hellojinwoo You should use a sliding windows so have overlapped parts for data augmentation (increasing the amount of the data that you have). However, the code here would only cluster those extracted windows not the stocks. Of course, the windows of stocks with high correlation shall lie down closer to each other in the latent-space when you check their coodinates for the "same time-frame".

By the way, I am also working on something similar (as you can see it from the graph below): portfolio_weights

This is also another interesting repository that I found, which could also be applicable to that problem: https://github.com/DerronXu/Deep-Co-Clustering

tejaslodaya commented 4 years ago

Hi @hellojinwoo ,

To answer your Q1 and Q3, I've used ECG5000 dataset in order to mask the original client data (NDA). It was just to show that this code runs, end-to-end.

Please modify it as per your needs.

To answer your Q2, you can approach it in 3 ways-

Create a daily level clustering of all the stocks. This is the best, naive solution considering the stocks fluctuate a lot and have a seasonal/trend component which changes w.r.t time. If you have enough compute, this is the best way to do. I would say, start with clustering every day for 10 days, see the drift (intra-cluster and inter-cluster distance). If it isn't too much, then set a frequency of refresh.
As @kayuksel suggested, use a sliding window approach. This works, but I've had bad experience with this. Auto-correlation used to be at the top because today's apple stocks depends on yesterday's apple stocks (check it with Auto-Regressive plots).
Average out x-days of stocks and cluster based on them. Obtain this "x" till the time your current date("dt") stops influencing ("dt-x") values.

All in all, I'm saying that there's no single answer. You have to do a lot of preprocessing+traditional forecasting to achieve what you're doing.

tejaslodaya commented 4 years ago

Thanks for the issue. Included it in FAQs section to increase visibility

tejaslodaya / timeseries-clustering-vae