stockparfait / experiments

Statistical experiments with financial data
Apache License 2.0
0 stars 0 forks source link

Implement "beta" experiment #93

Open sergey-a-berezin opened 1 year ago

sergey-a-berezin commented 1 year ago

Study cross-correlation between stocks, and most importantly, correlations with indexes. In particular:

sergey-a-berezin commented 1 year ago

First, we derive the formula for beta. If we model P = beta * I + R, then:

Var(R) = Var(P - beta*I) = Var(P) + beta^2 *Var(I) - 2*beta*Cov(P, I)

where Cov(X, Y) = E[(X-E[X])*(Y-E[Y])] is covariance.

This is a quadratic equation over beta, and its minimum is at the point of zero derivative:

d(Var(R)) / d(beta) = 2*beta*Var(I) - 2*Cov(P, I) = 0
b = Cov(P, I) / Var(I)

which is precisely the definition of beta as used in trading. And now we know where it comes from.

Moreover, we know that beta is only half the story. The other half is R, its distribution, mean and variance. The latter two we can readily derive:

E[R] = E[P] - beta*E[I]
Var(R) = Var(P) + (Cov(P, I) / Var(I))^2 * Var(I) - 2 * Cov(P, I) / Var(I) * Cov(P, I)
  = Var(P) - Cov(P, I)^2 / Var(I)

In particular, it's obvious that for any beta != 0, we have Var(R) < Var(P). Which, of course, how it should be, since Var(R) = Var(P) when beta=0, and the minimum can only be smaller.

sergey-a-berezin commented 1 year ago

Quick observations so far, with NASDAQ Composite (^IXIC) as the reference:

sergey-a-berezin commented 1 year ago

To do next:

sergey-a-berezin commented 1 year ago

The distribution of R cross-correlations for the liquid stocks has a mean of 0.038 and 90% interval of [-0.07..0.18], which is encouraging as the numbers are not very large. The sample has 10,262 tickers and 2,5121,390 log-profit samples. Note, that not all tickers will intersect, since there are usually 7-8K tickers listed at any given point in time.

Moreover, the distribution of synthetic R cross-correlations for 5,000 tickers each of 2,500 samples has a mean very close to 0 with the 90% range [-0.035..0.035]. This gives us an under-approximation of the confidence interval, which is only 3x as small as in the experiment. In other words, there is hope that the above interval is mostly due to random noise.

All synthetic correlations are computed over 2,500 samples, while in the experiment the lengths vary. Therefore, it would be useful to find out the actual distribution of sequence lengths and perhaps generate sequences of random length from a similar distribution for a more accurate confidence interval estimation.

sergey-a-berezin commented 1 year ago

To estimate the stability of beta over time, for each ticker compute beta[t-n]/beta[t] (option: do it for all available t), then add it as a point to a histogram. The expected value should be 1.0.

sergey-a-berezin commented 1 year ago

The ratio beta[t-n]/beta[t] proved to be too noisy and not representative, especially (I'm guessing) for beta close to 0. I need to find a better measure of stability and/or predictiveness of beta. Or, in fact, stability of any statistic - this could be useful for many other parameters.

Let's assume a statistic s[t] computed over a window w (that is, over log-profits in [t-w+1..t]). The null hypothesis is that s is constant for a given ticker over time, but may be different for different tickers. The alternative hypothesis is that s changes over time.

To test this hypothesis, we measure s over at least two windows, preferably disjoint, and somehow measure the difference. Then compare the resulting differences across stocks with the differences in synthetic series with a constant s, and see if the synthetic random noise is comparable with the experiment.

The trick is to find the right difference measure which wouldn't depend on the actual expected value of s and have the same random distribution due to sample noise.

One idea is to use similar normalization as we did for the log-profit distribution itself, only normalize for the value of s. That is, e.g. compute ticker's s for the entire available series, adjust the entire series to normalize s, e.g. make it equal 0 or 1, whatever makes more sense (for beta, it makes sense to normalize to 1 by dividing each log-profit by the average beta), and then compute diff=s1-s2 for the now normalized s.

In case of beta, it also makes sense to ignore tickers whose beta is too close to 0, e.g. |beta|<0.1.

sergey-a-berezin commented 1 year ago

In fact, we don't need to normalize the entire series; it is sufficient to compute beta[subrange] / beta - 1, still ignoring tickers with too small an average beta.

I think it makes sense to implement a generic method in experiments for generating data for a TimeShiftPlot from a log-profit series. It can then be used to estimate stability of other statistics such as mean, MAD, etc.

Note, that this plot only makes sense in comparison with a similar plot for synthetic data with a known and constant parameter under study. If the distribution of the experimental values is similar to the synthetic one, then we cannot reject the null hypothesis that the parameter is indeed constant and any changes are random noise. TO DO: come up with a way to estimate the confidence of rejection / non-rejection.