shokru / mlfactor.github.io

Website dedicated to a book on machine learning for factor investing
202 stars 96 forks source link

Outliers #64

Open MislavSag opened 3 years ago

MislavSag commented 3 years ago

I ma trying to set up data for analysis using FMP cloud data. I am not sure how to remove outliers from my data.

In the book you recommend winsorization:

"The winsorization stage must be performed on a feature-by-feature and a date-by-date basis. However, keeping a time series perspective is also useful."

If I get it right, using data.tables this procedure imply:

feature_set <- colnames(DT)[5:ncol(DT)]
DT <- DT[, (feature_set) := lapply(.SD, Winsorize, probs = c(0.05, 0.95)), by = .(date), .SDcols = feature_set] # across dates
DT <- DT[, (feature_set) := lapply(.SD, Winsorize, probs = c(0.05, 0.95)), by = .(symbol), .SDcols = feature_set] # across time

But I am not sure this is the rigt way. For example, lets say we have market cap feature. There is always one firm with highest market cap. If we do winsorization, we will always replace market cap of biggest firm with market cap of firm that belogns to 99 percentil. But this is not due to incorrect data or outliers.

Similar conclusion arise with time dimension. If EPS or examplegrows through time, we would replace highest EPS with 99 percentil even if data is not wrong.

shokru commented 3 years ago

Hi Mislav! I'm afraid I'm not clear enough in the book...

cross-section: take a fixed time & fixed characteristic, say, size (Market Cap). What you want it to make sure that you don't have a firm that has a $(10^20) market cap due to data errors, or also a firm which has a size 100 smaller than the second smallest firm. So winsorization is not too bad. Because take Apple, imagine it has a 900B$ cap and the 99% is at 600B$ (including Amazon, Google, etc.). Replacing 900B$ by 600B$ is a big mistake in absolute value, but for the models, it's not a big deal. In fact, if you uniformize afterwards, the impact is marginal.

time-series: here the purpose is to detect errors in the data. Imagine the series of Apple, which, for some dates is divided by 1,000. This is a big deal, and it could be important to make sure that Market Cap figures behave relatively smoothly in time (monthly variations, except in bankruptcy scenarii and taking splits into account, should not lie outside -80% and +200%). However, you should NOT windsorize in the time-series!

So in your code, the first line is ok, but not the second. Time-series outliers must be checked before and the purpose is to make sure you are confident with your values. In this case, if you have "crazy" values, you should replace them by the last "correct" value.

MislavSag commented 3 years ago

Thanks for insightful reply. I have removed outliers only across dates in the end. I have used percent_rank for uniformization because I saw you mentioned that function in some of the issue pages.

Then I have checked the distribution (histogram) of the variables, but it doesn't look uniform as yours for dividend yield in the book (this is an example for pb ratio): image

It has peaked at the beginning and at the end of the distribution. Is this normal, or should it be exactly uniform?

THe code for uniformization I used:

DT <- DT[, (features_set) := lapply(.SD, function(x) percent_rank(x)), by = date, .SDcols = features_set] # across dates
shokru commented 3 years ago

Hi Mislav,

yes, this often happens when there are"atoms" at the beginning and the end of the distribution, I guess in your case it comes from the prior winsorization. One thing you could do is uniformize (via percent_rank) even without winsorizing. Uniformization will automatically crush outliers by putting them close to 0 or 1... It's not a big deal if it's not perfectly uniform.

MislavSag commented 3 years ago

Thanks. all clear now.