Open gpsaggese opened 1 year ago
Hello
I implemented the RandomForestRegressor model using the example present at : https://www.cienciadedatos.net/documentos/py27-time-series-forecasting-python-scikitlearn.html The current predictions seems to be bad. The notebook is at : https://drive.google.com/file/d/1VUqwt6vBf_A00S-ut_LRcCqbuPZtWgLe/view?usp=share_link
Please check if you got time. I plan to implement the Backtesting approach next. We can discuss further how we can improve on this.
Thanks!!
Good job @Chandramani05. You did everything correctly, yet it will never work this way :-)
The problem is that time series prediction is different than classical "simple" machine learning by the book. Just to give you a few pointers: 1) you need to differentiate and predict returns (and not prices). Ideally fractionally differentiate 2) you need to refresh the model every few time steps 3) it's very difficult to predict a single asset. It's easier to predict a linear combination of them 4) returns are the most difficult to predict among the various quantities. Let's start with something simpler (e.g., volatility or spread), so we can tune the machinery
I'll organize a quick meeting with the team.
Compute the volatility as "historical" rolling.std(30)
Maybe differentiate or not (try both)
Autoregressive model
Baseline: estimate volatility in the next time stamp as the volatility in the past time stamp
Progress Report : I computed the volatility using the formula : df_original['log_returns'] = np.log(df_original['close']) - np.log(df_original['open']) df_original['volatility'] = df_original['log_returns'].rolling(window=30).std()
I took latest 3 months data for the model and took 1 week data for prediction :
After many tries and many hyper parameter tuning in the forest regressor model , I still can't find good result.
I try changing no of lags, no os estimators and depth , but no result.
This is the result :
I will start learning about backtesting and will incorporate in the method and if that can be of any effect.
Progress Report :
I tested the backtesting also, which is similar to cross_validation . I used 5 fold of cross validation and got the following result. The error seems to be reduced a little and it seems like each window is trying to fit to its initial inout.
My next approach is to learn how feature_imprtance works here and if I can manipulate lags value somehow.
@Chandramani05 results are weird, I think there is a problem. Can you pls do a PR with all the notebooks so I can take a look?
We can talk about it in the next meeting. Sending invite shortly.
Sure Professor Here is the link of the noteboo0k : https://drive.google.com/drive/folders/1GiT9EHILVCF_7QqOZFw9k5J8jrbANIPz
Let me know what i can do to improve.
Thanks!!
Hello @gpsaggese , Please make a separate branch for the ML Model Research so that we can push the code.
Not sure I follow.
You can create a branch and then a dir like 'SorrIssueXYZ_...' under https://github.com/sorrentum/sorrentum/tree/master/sorrentum_sandbox/examples/ml_projects
If you need some help with Git you can check https://github.com/gpsaggese/umd_data605/blob/main/lectures/02.1%20-%20Git%2C%20Data%20Pipelines.pdf and / or ask @samarth9008
Also I've improved the doc https://github.com/sorrentum/sorrentum/wiki/Organization-and-procedures#how-to-organize-your-research
Let me know if things are not clear
Hi, professor, I am xiaoyan, I am working on the spread. I have two questions here.
hi @shyanne399 you need to read a slice of the data by coin or by timestamp, since it's too much data
Parquet allows you do that. You can look at this tutorial to learn more about it https://github.com/gpsaggese/umd_data605/blob/main/tutorials/tutorial_packages/tutorial_parquet.ipynb
Hi professor, It’s very helpful. I will learn it as soon as possible.
Thanks and hope you have a nice day, Xiaoyan
On Fri, Mar 17, 2023 at 10:16 AM GP Saggese @.***> wrote:
hi @shyanne399 https://github.com/shyanne399 you need to read a slice of the data by coin or by timestamp, since it's too much data
Parquet allows you do that. You can look at this tutorial to learn more about it
— Reply to this email directly, view it on GitHub https://github.com/sorrentum/sorrentum/issues/39#issuecomment-1473911714, or unsubscribe https://github.com/notifications/unsubscribe-auth/A33AINQEECRXC5H6UQFAQVLW4RW2DANCNFSM6AAAAAAVHNMCHI . You are receiving this because you were mentioned.Message ID: @.***>
probably obsolete, moving to P1 for now
We want to write a short gdoc explaining how a financial time series problem can be formulated in the classical ML paradigm of supervised learning. Many prediction problems (e.g., price, volatility, spread, volume) can be formulated in this way. Once it's formulated, it's easy to apply sklearn or a Bayesian approach to solving the problem (in the end ML is just a form of optimization).
Assigning to @Chandramani05 as team lead but anybody can / should contribute