Open gpsaggese opened 1 year ago
Adding @iamdhairyagandhi
Let's make some progress on the specs and then we can have a Zoom meeting
I've been reading the research papers to familiarize myself to the terms/concepts and to overall understand the topic better. However, I'm having a bit of trouble digesting the formulas and understanding how the formulas work described in https://alo.mit.edu/wp-content/uploads/2015/08/TradingVolume.pdf.
Also the overarching question I have is how can I go about creating models and building tools from the literature I read? (I might be jumping the gun here. I'm just having trouble wrapping my head around on how to get started)
Thanks!
Let me break down the specs in digestible tasks and then we can attack one at a time. Often the most difficult step is the first one.
That sounds great. Thank you so much!
@dchoi127 I've added some more detailed specs
Note that this task has some commonality with #7 (besides the y variable and the predictors), so we might coordinate also with the other team. Let's start putting together a notebook reading the data and doing some exploratory analysis and then we can review the progress together
Adding also @iamdhairyagandhi
Added some notes on team organization here https://docs.google.com/document/d/1ELLDf7dg3nli6nLYMpQ9IxuTW5dYdN15nluNCZbZmD4/edit#heading=h.3kqeij2a9hu1
I have created a notebook reading in the data and I noticed that there were two entities named as timestamp. I was curious as to what the benefit was of storing the timestamp in two formats.
After fiddling around with matplotlib and python I managed to create a (hideous) plot of intraday trading volume by time. There were various issues such as readability (overall the graph is illegible and the xticks are illegible).
Some things/problems that I wanted to mention:
Below is the (hideous) plot
b
Although it isn't exactly progress I just wanted to update my status on the tasks; I plan on coming to research office hours to further discuss things and how to move forward.
Good job. The first step is always the most difficult.
We'll talk about this in details in the office hours.
I've added more explanation of the data meaning at https://docs.google.com/document/d/1SceHHhOWyusWJYibMvQhyz2_qxic9By-deGFwCHxKW8/edit#heading=h.sz1tkdpu8v2w
The next steps:
After some fiddling around with python, I've binned the data in 15 minute bins and graphed the following plot (this is for BTC_USDT).
(I took the average of the binned data for 15 minutes) In addition, I calculated the overall average and median at each 15 minute bins and plotted those as well.
Currently the mean and median graph does not tell us much information since the mean/median are not influenced by extraneous outliers when there are many datapoints in each bin. It appears that the mean/median volume remains fairly stable throughout the day.
I plan on coming to office hours to discuss this further.
Looks good @dchoi127
Can you do a PR with the notebook so I can take a look at the details?
If you need help on how to do the PR you can ask @FirstSingularity or @Chandramani05 (their info are https://docs.google.com/spreadsheets/d/1eRZJaj5-1g6W7w_Ay4UhJEdtAvrTTM1V94cKj6_Vwoc/edit#gid=1253964093)
I've been beginning to take a look at task 2 and review the things necessary to create such a model described in https://www.frontiersin.org/articles/10.3389/frai.2019.00021/full. The research paper used the tensorflow package and I've been reading up on tensorflow's documentation for RNN.
I was wondering what a feature vector would look like in our case and how we would compute the ground truth label for a given example in our testing set (since we don't have a column representing the change in direction).
I plan on coming to Friday's meeting to discuss more! Thanks!
Earlier in the week I accidentally nuked my jupyter environment and it messed everything up for a while. I had been trying to import the gluonTS module and for some reason jupyter does not recognize this package (so it caused me to do some things that eventually exploded my jupyter). Eventually I recovered my jupyter but the import statement still doesn't work. Strange!
I've been using statsmodel's ARIMA model to predict the volume , here is what I have so far. (3 minute sliding window data split). I had some conflicts with timezones and had to localize the times.
However there seems to be some strange bug after about 5 minutes of running the code terminates with a keyerror and I can't seem to understand why it happens so far into the execution. (I've attached a screenshot of the bug below)
Another thing I had a question about, but glossed over it when creating this model, was the order parameter when using the sm.tsa.ARMA model. I had done a google search and I believe they said that general models follow an order of (0,1,1). But it seems that further analysis of data needs to take place to properly find order.
The error is much longer but this is the last part of it.
EDIT: After investigating a bit longer it appears the error has to deal with the very last test/train split.
Today in research office hours we were reviewing the code and we observed that 500k models were far too many and 1 per each month would be a better structure. Upon attempting to implement I had a question about the set up for train/test sets.
What is our target prediction window? ie. given month data we wanted to predict the next month's data or maybe next day. I believe if we are trying to predict a window smaller than a month then the TimeSeriesSplit gets strange we may need to train more than a model for each month.
Also what should our order parameter be without extensively tuning for hyperparameters? Is there an order that we can use (without testing) that won't heavily affect our model?
After fiddling around with the train/test split just to get something going, I encountered a strange issue. Screenshots attached below
It appears the somehow the true_values and predictions are different lengths. It works for all models except for the train/test split of the very last model. It always messes up on the very last model predict and I can't seem to understand why because true_values and predictions should be the exact same length but for some strange reason it is off by one.
Also I was observing the predicted values for the models that did execute properly and it appears that towards the end of the test set all the predicted values will be the same. Screenshot attached below.
Today in research office hours we were reviewing the code and we observed that 500k models were far too many and 1 per each month would be a better structure. Upon attempting to implement I had a question about the set up for train/test sets.
What is our target prediction window? ie. given month data we wanted to predict the next month's data or maybe next day. I believe if we are trying to predict a window smaller than a month then the TimeSeriesSplit gets strange we may need to train more than a model for each month.
It's ok to make a prediction for the entire next month to have a baseline. Then we can crank up the frequency of retraining and see if the quality of the model increasing.
Also what should our order parameter be without extensively tuning for hyperparameters? Is there an order that we can use (without testing) that won't heavily affect our model?
There isn't a way to guess the model form here. I would use brute force to start with. If I had to choose I'd say a purely autoregressive model is probably best https://scikit-learn.org/stable/modules/grid_search.html
You can try to use a model with a seasonal component since we know that there are periodicity / seasonality https://www.statsmodels.org/stable/examples/notebooks/generated/statespace_seasonal.html
It appears the somehow the true_values and predictions are different lengths. It works for all models except for the train/test split of the very last model. It always messes up on the very last model predict and I can't seem to understand why because true_values and predictions should be the exact same length but for some strange reason it is off by one.
I'm not sure that one needs to specify max_train_size, n_splits, test_size
. I would expect that specifying the number of splits is enough.
I have my implementation of rolling cross validation and this is a new piece in sklearn.
In these cases one needs to look at the code / read the instructions / try smaller examples to figure out exactly how it works. To use the tools we need to understand them perfectly...
It appears the somehow the true_values and predictions are different lengths. It works for all models except for the train/test split of the very last model. It always messes up on the very last model predict and I can't seem to understand why because true_values and predictions should be the exact same length but for some strange reason it is off by one.
I'm not sure that one needs to specify
max_train_size, n_splits, test_size
. I would expect that specifying the number of splits is enough.I have my implementation of rolling cross validation and this is a new piece in sklearn.
In these cases one needs to look at the code / read the instructions / try smaller examples to figure out exactly how it works. To use the tools we need to understand them perfectly...
It turns out the issue was with the data itself not with the model or train/test split. It appears that in the bitcoin data, there is a duplicate row for the date 2022-12-19 00:00(almost everything is the same except the knowledge timestamp) which would throw off the predictor since there was a duplicate input value causing it to only predict one value for both inputs causing an off by one error. After removing duplicates the model executes without any issues.
I had been specifying the train_size and test_size to split the data into months as precisely as possible.
I'll move onto trying different models as well as investigating the autocorrelation plots
I was following @Chandramani05 implementation of the supervised machine learning model and created a model in order to predict volume.
Using the warm_start = True parameter, this allows for incremental learning such that every fit, does not override the previous learning session. Then I created splits to train on 3 months worth of data to predict 1 week and refreshed the model every split. This created a rmse of 197,000 versus the original error of 1,000,000,000
I plotted the corresponding graph for each split however the graphs are not very readable.
probably obsolete, moving to P1 for now
Specs are at https://docs.google.com/document/d/1ELLDf7dg3nli6nLYMpQ9IxuTW5dYdN15nluNCZbZmD4/edit#heading=h.ngpubt7e4lpq
We will keep adding more details there
Assigned to @dchoi127