shokru / mlfactor.github.io

Website dedicated to a book on machine learning for factor investing
200 stars 95 forks source link

Data for ml factor analysis #2

Open MislavSag opened 4 years ago

MislavSag commented 4 years ago

I started to read your book. I have finished chapters 1 and 2. In the book, you use the following data:

This dataset comprises information on 1,207 stocks listed in the US (possibly originating from Canada or Mexico). The time range starts in November 1998 and ends in March 2019. For each point in time, 93 characteristics describe the firms in the sample.

This data is anonymous. We don't know which stock is represented by id.

It would be very helpful if you can give some tips in the book how to get data in the first place. It would be very helpful for beginners in ml factor analysis (like me) who don't have data yet. This would be the first step if we would like to follow you analysis with real stocks.

I even have subscriptions on interactive brokers, but they don't have data on quarterly financial statements, only annual financial statements.

In nutshell, do you have any suggestions on how to obtain good data for ml factor analysis (good quality, as cheap as possible, as older as possible)?

shokru commented 4 years ago

Dear Mislav, sadly good quality data is not free.

At the very beginning of Section 5.1, we list a few data providers but their services can be quite expensive. Also, because of that, they do not allow their subscribers to download data and give it away for free. This is why the data we provide is anonymous: so that it is impossible to trace its origin - we don't want any problems!

Price data is easy to obtain via Yahoo finance (e.g. with the quantmod package) or other providers like Alphavantage and Tiingo for which Matt Dancho has created R interfaces. But for more "exotic" information, I do not know of publicly available data, though the R package edgar may be one way to circumvent this issue (via SEC filings).

I know of one other (bigger) dataset and you can find it here https://dachxiu.chicagobooth.edu (search for "empirical data"), but the returns are not given and the subscription to the corresponding service is prohibitively priced.

One option is to scrap the web for quarterly statements and I've seen a few tutorials both in R and Python. One example: https://www.linkedin.com/pulse/using-r-easily-bulk-scrape-financial-statements-matt-lunkes/ with code: https://github.com/MattLunkes/R_financial_scraper/blob/master/Easy_financial_scraper_quantmod.R (but it's a bit old & I haven't tested it)

In short: I'm sorry to disappoint, there is no simple cheap solution.

MislavSag commented 4 years ago

Could you explain what kind of data we need for factor analysis? If I got it right, we need quarterly financial statements and monthly price data (from which we can additionally calculate all kinds of indicators). Then we can calculate all kinds of ratios like P/E, P/B, EPS, etc. But some ratios are only available on a quarterly basis (like financial leverage). So, if we want monthly data, we can use ratios which include prices. If we want quoter data, we can use additional ratios from financial statements that don't include prices (like fin leverage). Is that correct?

I got annual data from IB, but it's not enough. I will try to contact hem to see if they can provide historical quater data (fundamental ratios and fin. statements).

How much in the past we have to go if we have a high stock universe (let's say 2.000 stocks).

shokru commented 4 years ago

The more data, the better obviously - roughly speaking. The minimum is indeed accounting data. It is released at best at quarterly frequencies. But like we say in Chapter 5, if, at the stock level, the time-series are smooth (e.g.: earnings, total assets, etc.), then you can always use a past value for 2 months in a row (say you take the Jan value for Feb and Mar). The richness of the signal comes from differences of values in the cross-section so this first order approximation is really no big deal, as long as you do have some additional features that are indeed sampled monthly. Some people prefer to take cross-sectional median for imputation, but I prefer past values, as long as they make sense. Again: see Chapter 5.

You can also add risk measures, like vol, that's not too hard to get. Market beta likewise. One easy family is momentum since it's price-based. Sentiment... hard to tell and sometimes expensive. Also, using macro-economic variable can help expand the width of features.

Of course, the deeper the dataset the better. Basically, you are going to need a minimum of 5 years to train your first model, but ideally more (10 years). And then you will roll (or expand) it forward, at least for 10 years to have sufficient testing history. So my advice is 15-20Y minimum in total, preferably more. The problem is that the further you go in the past, the scarcest the data becomes. This is a tradeoff, I prefer having a higher proportion of well-defined features (so that data imputation is less intensive) and not going back 40 years in the past.

MislavSag commented 4 years ago

Thank you for such a detailed answer. It's very helpful. You won't miss if you copy-paste this answer in the book (maybe as QA or something similar).

As you said, I will try to get data for span 2004-2019 min, or 1999-2019 max. I don't think I will find data for older periods. Additionally, for long t, probably panel would be very unbalanced, maybe it wouldn't be so helpful, but not sure.

I am reading chapter 4 now. I feel like I would need 10 years to read all these articles in debt :) Great review, especially because is up to date; new papers are inside.

Maybe I will come back with an additional questions. Think this can be closed.

MislavSag commented 3 years ago

@shokru ,

I am trying to calculate all data from chapter 17 from simfin+ subscription data.

For Mom_11M_Usd indicator it says:

price momentum 12 - 1 month in USD

Does that mean close price in, say December - close price in January? Or is it a percentage change? I would say it is a percentage change, but it says in USD?

shokru commented 3 years ago

Hi Mislav,

first a bit of history. The original paper of momentum is called "Returns to Buying Winners and Selling Losers: Implications for Stock Market Efficiency" and it has had a huge impact on both the literature & practitioners (easy to find online). The idea is that past performance has some kind of persistence. Subsequently, other researchers started digging into what "past performance" really is, that is: over which period of time do you compute the returns (see Is Momentum Really Momentum? by Novy-Marx).

The most common (accepted) definition of momentum is that you take the return computed as: P{t-12}/P{t-1}-1, where t is the current date and data is sampled monthly (P is the price naturally). Normally, you would expect to take P{t-12}/P{t}-1, which is the full annual return, but because of weak (negative) autocorrelation, it is considered that omitting the return of the past month is more efficient.

MislavSag commented 3 years ago

Thanks for detailed explanation. So, for monthly data, that would be:

(data.table::shift(close, 12)/data.table::shift(close, 1)) - 1

Vol1Y_Usd is volatility from monthly or daily observations? It is rolling window volatility?

shokru commented 3 years ago

Yes, the formula seems correct, though I guess the overarching brackets are superfluous. For the vol, normally you would take the standard deviation of daily returns during the past year if you have them. Otherwise monthly returns, but it's less robust. Since the data is, in fine, sampled monthly, 2 consecutive volatility estimates have an 11 month overlap.

MislavSag commented 3 years ago

I have daily data for all stocks so I can calculate volatility from daily data.

I can send you the codes when I finish. These can be one example of how to construct data for factor ml analysis.

immortal678 commented 3 years ago

my question is related to the uniformization of data. Unfortunately, I do not grab the concept of regularization in the cross-section. How does it actually work? Is it for a specific date, we uniformize the indicator eg price to book ratio across different firms? The credit spread example you gave on the solution of chapter 4 works by grouping the data by dates. I have normalized my own collected data as per your suggestion, but dates and stock-ids are also among the significant features choosen by penalized regression. According to my understanding, it is not conceptually correct in the cross-sectional study. If you can please eloborate with a working example, that will be great!

shokru commented 3 years ago

Let's say you want to explain future returns with 2 variables: market capitalization and past returns (momentum). Your dataset columns will look like: | date | stock_id | mkt_cap | past_return | future_return |

So first of all: dates & stock ids are NOT predictors, so you should probably leave them out. If you keep them, basically, you are using a panel approach which means that you can allow for trends in dates and trends in asset id. In a simple regression:

future_return = a + bdate +cstock_id + dmkt_cap + fpast_return+ error (time & stock indices are omitted)

this means that the dates & stock_id impact future returns which does not make any sense econometrically. Though of course you could build one model for each stock separately, but this is another issue. In the book, models are common to all stocks.

Ok so now we are in the setting of the book, where the model would be something like (in linear regression format):

future_return = a + bmkt_cap + cpast_return + error

Now, there is a big issue because mkt_cap and past returns really don't have the same scales. The first is measured in billions of $, while the second does not have units. We thus need to homogenise them because some models (eg: neural networks) behave much better when predictors have similar (small) scales. The simplest way to do that is to rank firms.

So, at each date, we process the data so that mkt_cap is equal to 0 for the smallest firm and to 1 for the largest firm, and same thing for past returns: 0 for the smallest return and 1 for the largest one. This is equivalent to computing the rank for each stock (starting at zero) and then dividing by the number of stocks. Some people normalize between -0.5 and +0.5 or -1 and +1. After this step, all variables are comparable in magnitude, so we can proceed.

In the exercise, the credit spread at date t is multiplied with all other predictor values; this changes the distribution of values, so we normalise again after the product. This will ensure that, again, predictors, at each date, have a uniform distribution across all stocks (between 0 for the smallest value to 1 for the largest value).

MislavSag commented 3 years ago

In TSPred package there is and function for adaptive normalization of nonstationary data (function an). Maybe that will help.

immortal678 commented 3 years ago

Thank you for your detailed answer! I will look into the package. Thank you for the recommendation!