Formalize a financial prediction problem as supervised ML

gpsaggese commented 1 year ago

We want to write a short gdoc explaining how a financial time series problem can be formulated in the classical ML paradigm of supervised learning. Many prediction problems (e.g., price, volatility, spread, volume) can be formulated in this way. Once it's formulated, it's easy to apply sklearn or a Bayesian approach to solving the problem (in the end ML is just a form of optimization).

Assigning to @Chandramani05 as team lead but anybody can / should contribute

gpsaggese commented 1 year ago

gdoc is here

Chandramani05 commented 1 year ago

Hello

I implemented the RandomForestRegressor model using the example present at : https://www.cienciadedatos.net/documentos/py27-time-series-forecasting-python-scikitlearn.html The current predictions seems to be bad. The notebook is at : https://drive.google.com/file/d/1VUqwt6vBf_A00S-ut_LRcCqbuPZtWgLe/view?usp=share_link

Please check if you got time. I plan to implement the Backtesting approach next. We can discuss further how we can improve on this.

Thanks!!

gpsaggese commented 1 year ago

Good job @Chandramani05. You did everything correctly, yet it will never work this way :-)

The problem is that time series prediction is different than classical "simple" machine learning by the book. Just to give you a few pointers: 1) you need to differentiate and predict returns (and not prices). Ideally fractionally differentiate 2) you need to refresh the model every few time steps 3) it's very difficult to predict a single asset. It's easier to predict a linear combination of them 4) returns are the most difficult to predict among the various quantities. Let's start with something simpler (e.g., volatility or spread), so we can tune the machinery

I'll organize a quick meeting with the team.

gpsaggese commented 1 year ago

Compute pct_change so you can predict returns
Learn on 3 months worth of data, predict 1 week
Compute hit rate (% of times you "guess" correctly the sign on the OOS, +- confidence interval)

Switch to predict volatility

Compute the volatility as "historical" rolling.std(30)
Maybe differentiate or not (try both)
Autoregressive model
Baseline: estimate volatility in the next time stamp as the volatility in the past time stamp
- Can you beat this baseline in a statistical significant way? Paired t-test (read about it)

Chandramani05 commented 1 year ago

Progress Report : I computed the volatility using the formula : df_original['log_returns'] = np.log(df_original['close']) - np.log(df_original['open']) df_original['volatility'] = df_original['log_returns'].rolling(window=30).std()

I took latest 3 months data for the model and took 1 week data for prediction :

After many tries and many hyper parameter tuning in the forest regressor model , I still can't find good result. I try changing no of lags, no os estimators and depth , but no result. This is the result :

I will start learning about backtesting and will incorporate in the method and if that can be of any effect.

Chandramani05 commented 1 year ago

Progress Report :

I tested the backtesting also, which is similar to cross_validation . I used 5 fold of cross validation and got the following result. The error seems to be reduced a little and it seems like each window is trying to fit to its initial inout.

My next approach is to learn how feature_imprtance works here and if I can manipulate lags value somehow.

gpsaggese commented 1 year ago

@Chandramani05 results are weird, I think there is a problem. Can you pls do a PR with all the notebooks so I can take a look?

We can talk about it in the next meeting. Sending invite shortly.

Chandramani05 commented 1 year ago

Sure Professor Here is the link of the noteboo0k : https://drive.google.com/drive/folders/1GiT9EHILVCF_7QqOZFw9k5J8jrbANIPz

Let me know what i can do to improve.

Thanks!!

Chandramani05 commented 1 year ago

Hello @gpsaggese , Please make a separate branch for the ML Model Research so that we can push the code.

gpsaggese commented 1 year ago

Not sure I follow.

You can create a branch and then a dir like 'SorrIssueXYZ_...' under https://github.com/sorrentum/sorrentum/tree/master/sorrentum_sandbox/examples/ml_projects

If you need some help with Git you can check https://github.com/gpsaggese/umd_data605/blob/main/lectures/02.1%20-%20Git%2C%20Data%20Pipelines.pdf and / or ask @samarth9008

gpsaggese commented 1 year ago

Also I've improved the doc https://github.com/sorrentum/sorrentum/wiki/Organization-and-procedures#how-to-organize-your-research

Let me know if things are not clear

shyanne399 commented 1 year ago

Hi, professor, I am xiaoyan, I am working on the spread. I have two questions here.

dataset I want to make sure whether I cope with the '1-Sec Orderbook (aka bid/ask)' datasets.
problem of running this dataset I have an error said that 'zsh killed' every time when I run and try to extract its timestamp, bid-price, and year etc.. I am guessing the reason is my dataset is too big and my computer(MAC with M1 chip) can't load it. Do you have any other ideas about this problem?

gpsaggese commented 1 year ago

hi @shyanne399 you need to read a slice of the data by coin or by timestamp, since it's too much data

Parquet allows you do that. You can look at this tutorial to learn more about it https://github.com/gpsaggese/umd_data605/blob/main/tutorials/tutorial_packages/tutorial_parquet.ipynb

shyanne399 commented 1 year ago

Hi professor, It’s very helpful. I will learn it as soon as possible.

Thanks and hope you have a nice day, Xiaoyan

On Fri, Mar 17, 2023 at 10:16 AM GP Saggese @.***> wrote:

hi @shyanne399 https://github.com/shyanne399 you need to read a slice of the data by coin or by timestamp, since it's too much data

Parquet allows you do that. You can look at this tutorial to learn more about it

https://github.com/gpsaggese/umd_data605/blob/main/tutorials/tutorial_packages/tutorial_parquet.ipynb

— Reply to this email directly, view it on GitHub https://github.com/sorrentum/sorrentum/issues/39#issuecomment-1473911714, or unsubscribe https://github.com/notifications/unsubscribe-auth/A33AINQEECRXC5H6UQFAQVLW4RW2DANCNFSM6AAAAAAVHNMCHI . You are receiving this because you were mentioned.Message ID: @.***>

DanilYachmenev commented 1 year ago

probably obsolete, moving to P1 for now

kaizen-ai / kaizenflow

Formalize a financial prediction problem as supervised ML #39

Switch to predict volatility