shokru / mlfactor.github.io

Website dedicated to a book on machine learning for factor investing
198 stars 95 forks source link

Dataset: Predictor distribution #52

Closed csrvermaak closed 3 years ago

csrvermaak commented 3 years ago

Many thanks for what seems to be a great book on the topic, I actually bought a hard copy for myself.

Question - why is the sample dataset uniformized? Most characteristics in the cross-section are more or less normally distributed, or log-normally in some instances (Market Cap) - obviously with outliers. The standard approach (Grinold and Kahn) would be to z-score the feature distributions and winsorize outliers. This retains the information density in the distributions, whereas (imo), the uniformization of the predictors lose the granularity in the centre of the distribution. I am new to ML, thus may be missing something foundational.

Your thoughts please?

shokru commented 3 years ago

This is a very good question. One naive answer would be that it's common practice in the ML workflow (e.g. in convolution networks for image processing). Relatedly, it is also true that neural networks work well when the features remain in the [-1,1] interval. But this does not hold for other families of techniques, like trees. In Section 4.4.2., we mention a few recent references in asset pricing theory that uniformize the predictors. It is a standard practice among practitioners now, as I have seen a few more papers processing their data similarly.

A more refined answer could be that we do not want the individual distribution of each feature to impact the models. Thus, we coerce them to uniformity, because it is the simplest choice. This is convenient because at any point in time, the set of predictor will always have the same uniform marginals. Depending on macro events, this may not be true for z-scoring. For instance, recently, the largest firms are increasingly to the right of the distribution, they are outliers, and important ones. The density of market cap changes through time... is it something that we want in our models (?), the question is open of course.

Another possibility is that asset pricing factors are often defined by distribution thresholds. For instance: long the top 30% of the feature (size, book-to-market, etc.) and short the low 30% (or the other way around). Of course, thresholds are arbitrary, they could be set at 50%, or 10%. Thus, it is the location in the distribution that matters, and not the raw values. This is what we get with uniformization: a processing that somehow mimics the way simple univariate factors are built.

But overall, I agree with you that you do lose a bit of information. My intuition is that it's not a big loss and that it is outweighed by the convenience of distributional stability in the features. To the best of my knowledge, no research has been published on that topic. Given the importance of data in the process, I agree that it is probably overlooked. Data processing is tedious and not intellectually rewarding, but it is key!

Thank you very much for your interest & don't hesitate if you have further questions.

csrvermaak commented 3 years ago

Many thanks for your reply, Guillaume. I understand what you are saying about the location. Perhaps another perspective is that the normalization is in essence forcing the features to be robust at the expense of information density by using location - in the same sense as a Spearman vs Pearson cross-sectional correlation. Would you agree?

If that is the case, I would assume the data is transformed from the original (as downloaded from provider) to uniform by the percent_rank() function (rank and then divide by count).

Finally, a few comments on the labels/return data. Are the RXM_Usd columns excess returns (has the effect of the market been removed), or are they unaltered actual stock returns?

I notice that the RXM_Usd columns are not uniformized. Given the noise on the return side and the impact of return drivers not captured by the features (Idiosyncratic ,Market, Macro), I would probably have expected the labels to be uniformized as well, as it might be easier to forecast location than actual return numbers.

shokru commented 3 years ago

Yes, I think robustness is key. You don't want your model/predictions to go crazy if some inputs change for some (macroeconomic) reason. In the book, we define our own function, but yes, I think you can use percent_rank(). In the dataset, the returns are raw. Strangely, I like to keep this information unaltered (it's pretty important). Also, it's the one that is used to assess performance (but of course, there could be several versions of returns...). But you are right, if location (in the cross-section) was what mattered more, then uniformizing would make sense.

In the book, we don't spend too much time on this topic, but if you want a more technical view on this, I suggest you have a look at our paper https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3403009. We argue that in fact, it's not the whole cross-section that matters, but extreme returns. In short, we give the model the best & worst returns (top/bottom 20%) to help the algorithm learn what matters most. Returns in the bulk of the distribution carry less information.