Open sergey-a-berezin opened 1 year ago
First, we derive the formula for beta
. If we model P = beta * I + R
, then:
Var(R) = Var(P - beta*I) = Var(P) + beta^2 *Var(I) - 2*beta*Cov(P, I)
where Cov(X, Y) = E[(X-E[X])*(Y-E[Y])]
is covariance.
This is a quadratic equation over beta
, and its minimum is at the point of zero derivative:
d(Var(R)) / d(beta) = 2*beta*Var(I) - 2*Cov(P, I) = 0
b = Cov(P, I) / Var(I)
which is precisely the definition of beta
as used in trading. And now we know where it comes from.
Moreover, we know that beta
is only half the story. The other half is R
, its distribution, mean and variance. The latter two we can readily derive:
E[R] = E[P] - beta*E[I]
Var(R) = Var(P) + (Cov(P, I) / Var(I))^2 * Var(I) - 2 * Cov(P, I) / Var(I) * Cov(P, I)
= Var(P) - Cov(P, I)^2 / Var(I)
In particular, it's obvious that for any beta != 0
, we have Var(R) < Var(P)
. Which, of course, how it should be, since Var(R) = Var(P)
when beta=0
, and the minimum can only be smaller.
Quick observations so far, with NASDAQ Composite (^IXIC
) as the reference:
[0.125..1.46]
with mean 0.76 for liquid stocks (vol > $1M).R
is the same as the general log-profits P
, that is, Student's t distribution with alpha~=3.0
.MAD[R]/MAD[P]
90% interval is [0.8..1.0]
, and Sigma[R]/Sigma[P]
- [0.825..0.997], suggesting that stocks generally don't follow the index that closely.R
seems smaller than mean of P
in general, but this needs to be compared ticker by ticker, and relative mean didn't make sense. TBD.To do next:
R
s from different stocks correlate. Compute the histogram of pairwise correlations.beta
and/or E[R]
predicts performance relative to the index. E.g. compute the metrics for N
samples, then measure performance in the subsequent K
samples. Plot a scatterplot of the results.The distribution of R
cross-correlations for the liquid stocks has a mean of 0.038
and 90% interval of [-0.07..0.18]
, which is encouraging as the numbers are not very large. The sample has 10,262
tickers and 2,5121,390
log-profit samples. Note, that not all tickers will intersect, since there are usually 7-8K tickers listed at any given point in time.
Moreover, the distribution of synthetic R
cross-correlations for 5,000
tickers each of 2,500
samples has a mean very close to 0
with the 90% range [-0.035..0.035]
. This gives us an under-approximation of the confidence interval, which is only 3x as small as in the experiment. In other words, there is hope that the above interval is mostly due to random noise.
All synthetic correlations are computed over 2,500
samples, while in the experiment the lengths vary. Therefore, it would be useful to find out the actual distribution of sequence lengths and perhaps generate sequences of random length from a similar distribution for a more accurate confidence interval estimation.
To estimate the stability of beta
over time, for each ticker compute beta[t-n]/beta[t]
(option: do it for all available t
), then add it as a point to a histogram. The expected value should be 1.0
.
The ratio beta[t-n]/beta[t]
proved to be too noisy and not representative, especially (I'm guessing) for beta
close to 0. I need to find a better measure of stability and/or predictiveness of beta
. Or, in fact, stability of any statistic - this could be useful for many other parameters.
Let's assume a statistic s[t]
computed over a window w
(that is, over log-profits in [t-w+1..t]
). The null hypothesis is that s
is constant for a given ticker over time, but may be different for different tickers. The alternative hypothesis is that s
changes over time.
To test this hypothesis, we measure s
over at least two windows, preferably disjoint, and somehow measure the difference. Then compare the resulting differences across stocks with the differences in synthetic series with a constant s
, and see if the synthetic random noise is comparable with the experiment.
The trick is to find the right difference measure which wouldn't depend on the actual expected value of s
and have the same random distribution due to sample noise.
One idea is to use similar normalization as we did for the log-profit distribution itself, only normalize for the value of s
. That is, e.g. compute ticker's s
for the entire available series, adjust the entire series to normalize s
, e.g. make it equal 0 or 1, whatever makes more sense (for beta
, it makes sense to normalize to 1 by dividing each log-profit by the average beta
), and then compute diff=s1-s2
for the now normalized s
.
In case of beta
, it also makes sense to ignore tickers whose beta
is too close to 0, e.g. |beta|<0.1
.
In fact, we don't need to normalize the entire series; it is sufficient to compute beta[subrange] / beta - 1
, still ignoring tickers with too small an average beta
.
I think it makes sense to implement a generic method in experiments
for generating data for a TimeShiftPlot
from a log-profit series. It can then be used to estimate stability of other statistics such as mean, MAD, etc.
Note, that this plot only makes sense in comparison with a similar plot for synthetic data with a known and constant parameter under study. If the distribution of the experimental values is similar to the synthetic one, then we cannot reject the null hypothesis that the parameter is indeed constant and any changes are random noise. TO DO: come up with a way to estimate the confidence of rejection / non-rejection.
Study cross-correlation between stocks, and most importantly, correlations with indexes. In particular:
P
asP = beta*I + R
, whereI
is the log-profit of some reference index,R
is a random variable, andbeta
is a constant factor minimizingR
's standard deviation.beta
:beta
? Test on synthetic data.R
:R
's distribution? Is it also Student's t? What's the parametera
?MAD(R)
withMAD(P)
.R
s of different stocks are correlated? Can we assume their independence for modeling portfolios?