shokru / mlfactor.github.io

Website dedicated to a book on machine learning for factor investing
198 stars 95 forks source link

Trouble replicating table 3.2 #76

Closed Bazman76 closed 12 months ago

Bazman76 commented 12 months ago

http://www.mlfactor.com/factor.html

http://www.mlfactor.com/chap_3.html

I am trying to replicate chapter 3 using Python.

Table 3.1 is slightly different in the R version of your book vs the python version of your book.

When table I get in Python matches the figures in the R version of table 3.1 exactly.

Table 3.2 is very different between your R version and your python version.

When I calculate table 3.2 I do not get the figures shown in either your R version or your python version.

The must likely cause of the difference between the R version of table 3.2 is the returns DataFrame shown below:

data_FM = pd.merge(returns.iloc[:,0].reset_index(),FF_factors.iloc[:,0:7],how='left', on=['date'])

I calculated returns exactly as it is done in chapter 1, the dates used in chapter 1 do not match the dates used in cell 4 to calculate FF

chapter 1 cell 7

http://www.mlfactor.com/chap_1.html

idx_date=data_raw.index[(data_raw['date'] > '1999-12-31') & (data_raw['date'] < '2019-01-01')].tolist()

vs

chapter 3 cell 4

min_date = '1963-07-31' max_date = '2020-03-28' idx_ff=df_ff.index[(df_ff['date'] >= min_date) & (df_ff['date'] <= max_date)].tolist()

Can you tell me how to calculate returns so that I can match the betas shown in table 3.2 of the R version of the book? Given that I already match table 3.1 in that book

shokru commented 12 months ago

Dear Bazman, it can happen that some differences occur between the two versions. If you already have the values of Table 3.1, it's clear that it's the second pass that is problematic. In the second pass, you must regress the returns against all betas (+ constant) on a date-by-date basis. I wonder if in, the online notebooks, the .shift(-1) in the loop might be the origin of this change... Let me know.

Bazman76 commented 12 months ago

Thanks for getting back so promptly.

Table 3.1 is simply the factor values as supplied by Fama French (but even here there are small differences between these values in your R version and your python version).

Table 3.2 contains the betas from the first pass and this is where the problem lies.

I will try altering the shift but in the mean time can you confirm that the returns Dataframe used to calculate the betas is exactly the same as the one calculated in the last line of chapter 1?

returns=data_ml[is_stock_ids_short].pivot(index='date',columns='stock_id',values='R1M_Usd')

shokru commented 12 months ago

This is a tricky question. I've only coded the R version and Tony is pretty busy at the moment. The fact that we arrive at different results is not surprising: we did things slightly differently. This is common in empirical research, see: https://papers.ssrn.com/sol3/results.cfm The idea of the book is not for you to reproduce exactly what we have done but for you to play with the code and explore the many aspects of ML & data science in the space of factor investing. I'm really sorry, I do not have time to re-run everything, as I am taken by many other projects at the moment....

Bazman76 commented 12 months ago

You were right I switched the .shift(-1) to .shift(+1) and it now matches the betas in table 3.2 of the R version of the book exactly.

Thanks again for your prompt response and for producing such a well written book supported with jupyter notebooks all finance books should be written this way imho.