jonathancornelissen / highfrequency

The highfrequency package contains an extensive toolkit for the use of highfrequency financial data in R. It contains functionality to manage, clean and match highfrequency trades and quotes data. Furthermore, it enables users to: calculate easily various liquidity measures, estimate and forecast volatility, and investigate microstructure noise and intraday periodicity.
147 stars 63 forks source link

Interpreting the result from `aggregateQuotes` #92

Closed stucash closed 2 years ago

stucash commented 2 years ago
library(highfrequency)
library(data.table)
library(dplyr)

> singleDaySampleQData <- sampleQData[month(DT)==1 & yday(DT)==2]
> aggregateQuotes(singleDaySampleQData, alignPeriod=1)  # aggregate with default time unit (`minutes`) and number of periods (`1`);
# output
                      DT SYMBOL    BID    OFR OFRSIZ BIDSIZ EX MIDQUOTE
  1: 2018-01-02 09:30:00    XXX 158.39 158.50     18      1  N  158.445
  2: 2018-01-02 09:31:00    XXX 158.40 158.51    139    239  N  158.455
  3: 2018-01-02 09:32:00    XXX 158.50 158.67    183    328  N  158.585
  4: 2018-01-02 09:33:00    XXX 158.60 158.75    129    332  N  158.675
  5: 2018-01-02 09:34:00    XXX 158.88 159.00   1528    447  N  158.940
 ---                                                                   
387: 2018-01-02 15:56:00    XXX 156.79 156.81   1260   1046  N  156.800
388: 2018-01-02 15:57:00    XXX 156.79 156.80   1239    893  N  156.795
389: 2018-01-02 15:58:00    XXX 156.82 156.83   1589   1196  N  156.825
390: 2018-01-02 15:59:00    XXX 156.89 156.91   2528   1185  N  156.900
391: 2018-01-02 16:00:00    XXX 157.02 157.03   7624   2395  N  157.025

# using `dplyr` here to be more explicit
> singleDaySampleQData %>% 
  mutate(HM = strftime(DT, "%H:%M", tz="US/Eastern") ) %>% 
  group_by(HM) %>% 
  summarise( across( .cols=c(OFRSIZ, BIDSIZ), .fns = sum) )
# output
# A tibble: 390 × 3
   HM    OFRSIZ BIDSIZ
   <chr>  <int>  <int>
 1 09:30    157    240
 2 09:31    183    328
 3 09:32    129    332
 4 09:33   1528    447
 5 09:34    450    596
 6 09:35    239    389
 7 09:36    339    221
 8 09:37    311    340
 9 09:38    423    178
10 09:39    425    429
# … with 380 more rows

# singleDaySampleQData itself 
> singleDaySampleQData
# output
                            DT SYMBOL    BID    OFR OFRSIZ BIDSIZ EX MIDQUOTE
    1: 2018-01-02 09:30:00.115    XXX 158.39 158.50     18      1  N  158.445
    2: 2018-01-02 09:30:00.125    XXX 158.39 158.50     18      1  N  158.445
    3: 2018-01-02 09:30:00.145    XXX 158.39 158.58      1      1  N  158.485
    4: 2018-01-02 09:30:00.176    XXX 158.39 158.58      1      1  N  158.485
    5: 2018-01-02 09:30:00.243    XXX 158.39 158.58      1      1  N  158.485
   ---                                                                       
24473: 2018-01-02 15:59:59.359    XXX 157.02 157.03     34      8  N  157.025
24474: 2018-01-02 15:59:59.640    XXX 157.02 157.03     85      8  N  157.025
24475: 2018-01-02 15:59:59.710    XXX 157.02 157.03     86      6  N  157.025
24476: 2018-01-02 15:59:59.950    XXX 157.02 157.03     50      3  N  157.025
24477: 2018-01-02 15:59:59.980    XXX 157.02 157.03     52      3  N  157.025

I understand aggregateQuotes would always show the first observation (here it'd be the first row, timestamped at 09:30:00.115); what I didn't underestand was, the sum of the interval 09:30:00 did not include the first row. If we add up the first two rows from result of aggregateQuotes, we'd get the same result as we saw in the dplyr version (OFRSIZ: 18 + 139=257, BIDSIZ: 1 + 239 = 240). My experience is that we ususally disregard the first observation in actual analysis therefore this arrangement would result in incorrect calculation going forward (difference may well be trivial, I know).

If this is the original intention may I ask why (as in why did the team choose to leave out the first row from the sum of the first interval)?

Thanks a lot!

stucash commented 2 years ago

Never mind, I guess I've done too much coding and need a break.