ENH: zero aware quantiles

quantopian / alphalens

Performance analysis of predictive (alpha) stock factors

http://quantopian.github.io/alphalens

Apache License 2.0

3.26k stars 1.13k forks source link

ENH: zero aware quantiles #306

Closed eigenfoo closed 6 years ago

eigenfoo commented 6 years ago

Closes #304 and #281.

This adds the functionality to compute quantile buckets separately for positive and negative signal values. This is useful if your signal/factor is centered and zero is the separation between long and short signals, respectively.

A separate parameter, called zero_aware, has been added to the quantize_factor function. If True, alphalens will compute the quantiles while being aware of the positive-negative split.

eigenfoo commented 6 years ago

It looks like I'm relying on something that's only supported in pandas v0.20 or higher: those tests pass, but the rest don't. I can't imagine what it is, but I'll take a look.

luca-s commented 6 years ago

@twiecki don't you want the zero_aware flag to be exposed to get_clean_factor_and_forward_returns?

@eigenfoo one suggestion. Since it is almost no effort, why don't we add the zero_aware logic to the bins option too? As a user I would expect the flag to work with both bins and quantiles

luca-s commented 6 years ago

@twiecki @eigenfoo I am not sure if you agree, but I would expect the binning to be symmetrical on positive and negative values. So when performing the binning on the negative values, we should compute the absolute values of those, that is:

neg_quantiles = pd.qcut(x[x < 0].abs(), _quantiles // 2, labels=False) + 1

And then the quantile numbers have to be reversed.

eigenfoo commented 6 years ago

All good points! I'll take care of those.

@twiecki can we confirm that we want the symmetric bucketing: currently, zero is pegged to the "halfway quantile" (e.g. the 50th percentile). I like the idea of symmetric bucketing but that means we'd need to have negative quantile numbers to distinguish between positive and negative factor values.

Also, I'm concerned about the failing tests: I suspect it has something to do with changes to pd.qcut between versions 0.19 and 0.20, although I can't really find anything. Any thoughts @luca-s?

luca-s commented 6 years ago

@eigenfoo I believe you get an error because either the positive or the negative subset is empty. The test that fails is the one where the binning is performed by group, and It seems to me the groups contain either all positive or all negative values.

luca-s commented 6 years ago

@eigenfoo Regarding the symmetric bucketing: I didn't mean that we have to keep the negative quantiles number, we can convert them to positive ones. Actually we have to convert them to positive ones, because I don't believe Alphalens can handle negative quantile numbers. My idea was that if you provide custom ranges for quantiles or bins you probably want them to be applied consistently to both positive and negative values. Now I have just realized the code wouldn't work with custom ranges.

eigenfoo commented 6 years ago

@luca-s I don't think I follow: if the quantile numbers don't reflect the positive/negative value of the factor, then how will we distinguish between, say, an asset in the 20th positive quantile and the 20th negative quantile?

I also realize that the zero_aware functionality does not make sense with custom quantile/bin ranges. I will add a test to make sure that zero_aware is ignored unless quantiles or bins is an integer. Does that sound reasonable?

luca-s commented 6 years ago

@eigenfoo perfect, you can make a check at the beginning of the function and raise a ValueError if zero_aware is true and quantiles or bins is not an integer. In this way we can forget about symmetric bucketing and keep the current logic

eigenfoo commented 6 years ago

@luca-s All points have been resolved, except for the tests. It cannot be what you described (i.e. that either the positive or negative subset is empty), because all tests with pandas version >= 0.20 passed successfully.

Perhaps it is to do with this bug that was patched in pandas 0.20?

luca-s commented 6 years ago

@eigenfoo before your latest change the bug was what I said, now that you changed the tests the bug is the one you linked. Add more data to the biased_factor and you should be able to avoid hitting the bug

eigenfoo commented 6 years ago

What an interesting bug; nice catch @luca-s! Ready for review now.

luca-s commented 6 years ago

It seems all good now, thanks @eigenfoo . I will go on and merge this