Open hellojinwoo opened 5 years ago
@hellojinwoo You should use a sliding windows so have overlapped parts for data augmentation (increasing the amount of the data that you have). However, the code here would only cluster those extracted windows not the stocks. Of course, the windows of stocks with high correlation shall lie down closer to each other in the latent-space when you check their coodinates for the "same time-frame".
By the way, I am also working on something similar (as you can see it from the graph below):
This is also another interesting repository that I found, which could also be applicable to that problem: https://github.com/DerronXu/Deep-Co-Clustering
Hi @hellojinwoo ,
To answer your Q1 and Q3, I've used ECG5000 dataset in order to mask the original client data (NDA). It was just to show that this code runs, end-to-end.
Please modify it as per your needs.
To answer your Q2, you can approach it in 3 ways-
All in all, I'm saying that there's no single answer. You have to do a lot of preprocessing+traditional forecasting to achieve what you're doing.
Thanks for the issue. Included it in FAQs section to increase visibility
Hello tejaslodaya, I have 3 questions regarding the codes, which I would appreciate it if you can answer.
Q1. How did you train the VRAE with 9500 data points (8500 train+ 950 test), when the UCR archive has only 5000 data points in total?
According to the UCR time-series archive website , the ECG5000 dataset is comprised of 500 train dataset and 4500 test dataset. But your readme.md says as follows:
Where is the difference of data points (4500 data points difference) between your dataset and UCR archive coming from?
Q2. For the clustering purpose, which would be the good way to slice the daily_return time-series vector of stocks to train the VRAE?
cf) You can calculate the daily return (Ri) as the picture shown below.
Let's say I have a daily_return time-series data of Apple stock (AAPL) from 2010/1/1 to 2018/12/31. It is almost 2000-dimensional 1D vector. What would be the best way to slice this 2000-dimensional vector?
The research paper Variational Recurrent Auto-encodertried both ways: dividing with overlapping parts and without over-lapping parts
dividing without overlapping parts
dividing with overlapping parts
Since I am not interested in generation, which is the usual purpose of VAE, I am wondering which way I should follow. My goal is to cluster stocks based on daily_return time-series vector. Any advice on how to slice the stock daily return time-series data would be very much appreciated!
Q3. Why didn't you follow the author's way of slicing the time-series data?
As far as I know, the ECG 5000 data does not overlap with each other, but you used the ECG data to train the VRAE model. Does it imply that VRAE can be trained with non-overlapping data?
Thank you for reading the questions.