jonathancornelissen / highfrequency

The highfrequency package contains an extensive toolkit for the use of highfrequency financial data in R. It contains functionality to manage, clean and match highfrequency trades and quotes data. Furthermore, it enables users to: calculate easily various liquidity measures, estimate and forecast volatility, and investigate microstructure noise and intraday periodicity.
147 stars 63 forks source link

rKernelCov generating negative result for 15 minutes intraday open-close log return (demeaned) #67

Closed stucash closed 3 years ago

stucash commented 3 years ago

I understand the package is mainly for tick data but I believe minute interval falls into high frequency category as well, therefore I was feeding 15 minute interval intraday demeaned log return data (open-close return) to rKernelCov function, with alignBy='minutes', alignPeriod=15. However, I was getting negative variance result.

Demeaned Log Return: (testv)

2020-11-11 09:00:00 -0.157552500
2020-11-11 09:15:00 -0.478726694
2020-11-11 09:30:00  0.635347473
2020-11-11 09:45:00 -0.831432746
2020-11-11 10:00:00  0.563903981
2020-11-11 10:15:00 -0.257693776
2020-11-11 10:30:00 -0.130999035
2020-11-11 10:45:00  0.236480735
2020-11-11 11:00:00  0.494014365
2020-11-11 11:15:00  0.005768435
2020-11-11 11:30:00  0.235998660
2020-11-11 11:45:00  0.010471418
2020-11-11 12:00:00  0.005796665
2020-11-11 12:15:00 -0.078381046
2020-11-11 12:30:00 -0.101878036
2020-11-11 12:45:00 -0.224006167
2020-11-11 13:00:00 -0.034176174
2020-11-11 13:15:00 -0.036550801
2020-11-11 13:30:00  0.179403324
2020-11-11 13:45:00 -0.116285994
2020-11-11 14:00:00  0.005726209
2020-11-11 14:15:00  0.022165163
2020-11-11 14:30:00 -0.076530706
2020-11-11 14:45:00 -0.123691149
2020-11-11 15:00:00  0.259176342
2020-11-11 15:15:00 -0.177654222
2020-11-11 15:30:00  0.329952389
2020-11-11 15:45:00 -0.158646114

Demeaned Log Return for Beta calculation (testgs):

2020-11-11 09:00:00 -0.07162338
2020-11-11 09:15:00 -0.31874177
2020-11-11 09:30:00  1.02229152
2020-11-11 09:45:00  0.05719529
2020-11-11 10:00:00 -0.10690077
2020-11-11 10:15:00  0.36487529
2020-11-11 10:30:00 -0.58934476
2020-11-11 10:45:00  0.23208744
2020-11-11 11:00:00 -0.38503252
2020-11-11 11:15:00 -0.05803151
2020-11-11 11:30:00 -0.69171630
2020-11-11 11:45:00  0.48175186
2020-11-11 12:00:00 -0.21121323
2020-11-11 12:15:00  0.19400404
2020-11-11 12:30:00 -0.20441684
2020-11-11 12:45:00 -0.26556115
2020-11-11 13:00:00  0.24824897
2020-11-11 13:15:00 -0.43623240
2020-11-11 13:30:00  0.03808637
2020-11-11 13:45:00  0.27068605
2020-11-11 14:00:00 -0.11921513
2020-11-11 14:15:00  0.25276255
2020-11-11 14:30:00  0.26710518
2020-11-11 14:45:00  0.06136138
2020-11-11 15:00:00 -0.02353558
2020-11-11 15:15:00  0.51702493
2020-11-11 15:30:00 -0.29313620
2020-11-11 15:45:00 -0.37943148

Original OHLCV (a tibble, data.frame, or data.table normally, called v):

datetime                open     high      low    close   volume   symbol
2020-11-11 09:00:00 213.5300 213.6100 213.1500 213.1516   4960      V
2020-11-11 09:15:00 213.1500 213.4500 211.9800 212.0900   4108      V
2020-11-11 09:30:00 212.1700 213.6750 212.0100 213.4800 446770      V
2020-11-11 09:45:00 213.5350 213.8600 211.6700 211.7250 201760      V
2020-11-11 10:00:00 211.7050 212.9764 211.7050 212.8600 120891      V
2020-11-11 10:15:00 212.9000 213.0100 211.9250 212.3100  91855      V
2020-11-11 10:30:00 212.3400 212.8500 211.9584 212.0200 124010      V
2020-11-11 10:45:00 212.0800 212.6600 211.9400 212.5400  80901      V
2020-11-11 11:00:00 212.5497 213.7400 212.4300 213.5600 103031      V
2020-11-11 11:15:00 213.5700 213.8050 213.4800 213.5400 103072      V
2020-11-11 11:30:00 213.5700 214.1300 213.5700 214.0322 141786      V
2020-11-11 11:45:00 214.0300 214.2800 213.8500 214.0100  92270      V
2020-11-11 12:00:00 214.0000 214.1899 213.7900 213.9700  82362      V
2020-11-11 12:15:00 213.9600 213.9700 213.5800 213.7500  40243      V
2020-11-11 12:30:00 213.7800 213.8800 213.4200 213.5200  51736      V
2020-11-11 12:45:00 213.5300 213.6700 212.8900 213.0100  85892      V
2020-11-11 13:00:00 213.0500 213.2700 212.9300 212.9350  65135      V
2020-11-11 13:15:00 212.9500 213.1000 212.7700 212.8300  84031      V
2020-11-11 13:30:00 212.8800 213.5400 212.8700 213.2200  93369      V
2020-11-11 13:45:00 213.2200 213.3600 212.9300 212.9300  78724      V
2020-11-11 14:00:00 212.9300 213.2600 212.8700 212.9000  68217      V
2020-11-11 14:15:00 212.8700 213.0100 212.4850 212.8750 113408      V
2020-11-11 14:30:00 212.8750 212.9710 212.3682 212.6700  88292      V
2020-11-11 14:45:00 212.6850 212.8000 212.2200 212.3800 158603      V
2020-11-11 15:00:00 212.3960 213.0000 212.3385 212.9050 139843      V
2020-11-11 15:15:00 212.9000 212.9000 212.2100 212.4800 143600      V
2020-11-11 15:30:00 212.4800 213.1400 212.4500 213.1400 183770      V
2020-11-11 15:45:00 213.1200 213.6900 212.5400 212.7400 467470      V

The function I used to obtain the demeaned log return from original open-close prices is here:

oc.logret <- function(open, close){
  oc.log <- log(close) - log(open) 
  oc.log.demean <- oc.log - mean(oc.log)
}

vv <- v %>% mutate(lret=oc.logret(open, close)) %>% dplyr::select(datetime, lret)
testv <- xts::xts(vv$lret, order.by = as.POSIXct(vv$datetime))

Then I run rKernelCov and rBeta on the return series:

rKernelCov(testv, alignBy = "minutes", alignPeriod = 15)
[1] -0.6718775
rBeta(testv, testgs, RCOVestimator = "rKernelCov")
Error in seq.default(start(ts), end(ts), by = tby) : 
  'from' must be of length 1
> rBeta(testv, testgs, RCOVestimator = "rCov")
[1] 0.0760269
> rBeta(testv, testgs, RCOVestimator = "rAVGCov")
Error in setnames(x, value) : 
  Can't assign 2 names to a 1 column data.table
> rBeta(testv, testgs, RCOVestimator = "rBPCOv")
numeric(0)
> rBeta(testv, testgs, RCOVestimator = "rBPCov")
[1] 0.1112035
> rBeta(testv, testgs, RCOVestimator = "rHYCov")
Error in seq.default(start(ts), end(ts), by = tby) : 
  'from' must be of length 1
> rBeta(testv, testgs, RCOVestimator = "rOWCov")
[1] 0.07655626
> rBeta(testv, testgs, RCOVestimator = "rRTSCov")
[1] "when using rRTSCov, rTSCov, introduce price data - transformation to price data done"
[1] "when using rRTSCov, rTSCov, introduce price data - transformation to price data done"
Error in rRTSCov(list(rData, rIndex)) : 
  Two time scale estimator uses returns based on prices that are K ticks aways.
           Please provide a timeseries of at least 10*K
> rBeta(testv, testgs, RCOVestimator = "rTSCov")
[1] "when using rRTSCov, rTSCov, introduce price data - transformation to price data done"
[1] "when using rRTSCov, rTSCov, introduce price data - transformation to price data done"
Error in rTSCov(list(rData, rIndex)) : 
  Two time scale estimator uses returns based on prices that are K ticks aways.
           Please provide a timeseries of at least 10*K

The most baffling to me is that the variance was negative in rKernelCov and I guessed that it had an impact on rBeta calculation as well. I can't really see if I have fed into the functions some unstable numeric values but the results from them don't seem right. I hope it was my demeaned log return calculation being wrong that way it's easy to fix (but I can't see that my calculation for return was wrong...)

I am running MX Linux 19, R version is 4.0.4 (2021-02-15) -- "Lost Library Book"

stucash commented 3 years ago

Hi can you take a look if it was the function needs some tuning or it was me did something fundamentally wrong?

emilsjoerup commented 3 years ago

I am currently digging into it. It may take some time this seems really weird :)

stucash commented 3 years ago

I am currently digging into it. It may take some time this seems really weird :)

thanks for getting back :) sorry I got stuck for some days lol I was so obsessed

emilsjoerup commented 3 years ago

I was unable to replicate the negative variance with other data.

Changing the kernel tuning parameter, kernelParam seems to fix the problem.

I am tempted to just call it numerical instability, which is not really that surprising since the kernel covariance estimator is not guaranteed to be positive semi definite, therefore we can also in the multivariate case get covariance estimates that would theoretically allow for construction of portfolios with negative variance. - Not good.

We automatically map covariance matrices to a positive semi definite matrix, but in the univariate case we can't really do that in a nice way IMO.

I have added a warning when the estimated variance is negative which hints that changing the kernel parameter may help.

Thanks a lot for reporting this.

stucash commented 3 years ago

Thanks for digging into it, I was guessing the same; One interesting thing was, I initially ran the function against few stock symbols (multivariate input) and I remember that's when I first witnessed a negative variance value I then was trying to single out the error. I'll see if I can reproduce it.

emilsjoerup commented 3 years ago

Please do, that sounds very weird.