rcyeh / cfem2013

Cornell Financial Engineering Manhattan 2013 Project
1 stars 1 forks source link

SOI Parser Performance Improvement #13

Open zc238 opened 10 years ago

zc238 commented 10 years ago

For SOI parser performance enhancement. The code is isolated at RScripts/parserPerfromanceTest.R

It is necessary to source parser.R first for all the necessary functions. Then simply run every line from the file.

Two main functions are called inside the loops to determine the best combinations of bucket size/time bin/thres hold for excluding trades that can achieve the highest R^2.

Side note: For SPY, I had to add the following line (by default, you can assess the memory limit R uses by calling memory.limit() on a windows machine):

invisible(utils::memory.limit(7987))

in R//etc/Rprofile.site in order to obtain enough memory, I am not sure what is the equivalent construct for a Mac.

Appreciate the help!

rcyeh commented 10 years ago

On my computer, calc_OI_by_time_buckets by time buckets takes 15-16 seconds for AMZN. Since you scanned a 3-dimensional parameter space, you had 675 runs, which I can see would take 3 hours just for AMZN.

When trying to run the same for SPY, I saw the process memory jumping around 1 GB and it took too long.

I did:

> Rprof(memory.profiling=TRUE, line.profiling=TRUE)
> SOI_buckets_delta_prices <- calc_OI_by_time_buckets(time_bin,trades[which(trades$size>thres1),],bucket_size, F, L, T)     
> Rprof(NULL)
> summaryRprof(lines="both")

and the top results were:

$by.self
                        self.time self.pct total.time total.pct
"[.data.frame"               4.00    25.77       5.02     32.35
"[<-.data.frame"             3.72    23.97       3.82     24.61
parser.R#286                 1.58    10.18       5.40     34.79
"as.POSIXct.POSIXlt"         0.62     3.99       1.20      7.73
"structure"                  0.62     3.99       0.86      5.54
"strptime"                   0.48     3.09       1.36      8.76
...
$by.total
                          total.time total.pct self.time self.pct
"calc_OI_by_time_buckets"      15.52    100.00      0.00     0.00
parser.R#286                    5.40     34.79      1.58    10.18
"["                             5.20     33.51      0.00     0.00
"[.data.frame"                  5.02     32.35      4.00    25.77
"[<-.data.frame"                3.82     24.61      3.72    23.97
"[<-"                           3.82     24.61      0.00     0.00
parser.R#324                    3.46     22.29      0.06     0.39
"-"                             2.84     18.30      0.16     1.03
"-.POSIXt"                      2.68     17.27      0.04     0.26
"difftime"                      2.64     17.01      0.12     0.77
"strptime"                      1.36      8.76      0.48     3.09
parser.R#245                    1.36      8.76      0.00     0.00
parser.R#230                    1.34      8.63      0.06     0.39
"as.POSIXct"                    1.30      8.38      0.10     0.64
parser.R#247                    1.28      8.25      0.04     0.26
parser.R#246                    1.26      8.12      0.04     0.26
"as.POSIXct.POSIXlt"            1.20      7.73      0.62     3.99
...

I'm surprised at the all the calls related to time, and reading the code, I see you are doing:

    time <- strptime(trades[i+q,"time"],"%H:%M:%OS")
    volume <- as.numeric(trades[i+q,"size"])
    price <- as.numeric(trades[i+q,"price"])

in every parsing function. In R, it is more efficient to process vectors than one-element-at-a-time. At the very least, you are doing this 675 times more than you need to. I changed these to a one-time:

    a[["time"]] <- strptime(a[["time"]],"%H:%M:%OS")

when you first read in the file. Further, size and price are already double-precision floating-point columns --- there's no need to as.numeric them again. I re-run profiling, and discover that there appears to be no improvement:

$by.total
                          total.time total.pct self.time self.pct
"calc_OI_by_time_buckets"      14.02    100.00      0.02     0.14
...

and there continue to be many calls to as.POSIXct and the like. Let's go one step further:

    a[["time"]] <- as.integer(as.POSIXct(strptime(a[["time"]],"%H:%M:%OS")))

since I don't think you use the microsecond resolution in that column. The result is better, mostly probably due to memory caching:

$by.total
                          total.time total.pct self.time self.pct
"calc_OI_by_time_buckets"       8.12    100.00      0.00     0.00
"["                             4.00     49.26      0.02     0.25
"[.data.frame"                  3.70     45.57      2.80    34.48
parser.R#286                    3.06     37.68      1.04    12.81
"[<-.data.frame"                2.04     25.12      1.96    24.14
"[<-"                           2.04     25.12      0.00     0.00
parser.R#245                    1.08     13.30      0.12     1.48
parser.R#230                    0.84     10.34      0.06     0.74
parser.R#246                    0.82     10.10      0.08     0.99
parser.R#247                    0.80      9.85      0.08     0.99
parser.R#324                    0.42      5.17      0.12     1.48
"%in%"                          0.38      4.68      0.16     1.97
parser.R#228                    0.26      3.20      0.06     0.74

Next, I'm not sure what to do with all of the "[" and data.frame costs --- I think those are due to the element-by-element access done, some of which may be unavoidable due to the VPIN algorithm needing to maintain state. You'd have to do a wholesale rewrite, adding columns to track the running volume to be assigned to the next bucket.

Let's put that aside for now. The next most-expensive area is line 286, which is:

        trades[i+q,"size"] <- volume_count + volume - bucket_volume_size

That looks innocuous, but I found that by rewriting every one of your trades[i+q,"col"] constructs as trades$col[i+q], I saved another 40% of time. I'm now at:

$by.self
                  self.time self.pct total.time total.pct
"$<-.data.frame"       1.76    53.33       1.80     54.55
parser.R#286           0.52    15.76       2.32     70.30
parser.R#247           0.18     5.45       0.18      5.45
parser.R#230           0.12     3.64       0.12      3.64
parser.R#245           0.08     2.42       0.08      2.42
...

$by.total
                          total.time total.pct self.time self.pct
"calc_OI_by_time_buckets"       3.30    100.00      0.00     0.00
parser.R#286                    2.32     70.30      0.52    15.76
"$<-.data.frame"                1.80     54.55      1.76    53.33
"$<-"                           1.80     54.55      0.00     0.00
"$"                             0.22      6.67      0.00     0.00
parser.R#247                    0.18      5.45      0.18     5.45
parser.R#230                    0.12      3.64      0.12     3.64
parser.R#245                    0.08      2.42      0.08     2.42
"var"                           0.08      2.42      0.02     0.61
"assign_buy"                    0.08      2.42      0.00     0.00

which I have committed below at aa4e93cbb2759468706854cc8e5b80a3a3d70f24.

Regarding parser.R lines 40--54: it seems that the sense of the comparison is opposite in lines 40-46 than 48-54. I would guess that lines 41 and 43 are wrong --- would you check?

Regarding parser.R lines 233-239: just to check, you normally don't call this function with just trades, right? Otherwise, the assign_buy function later won't have a quote to use. I think you can get combine the i and the q variables. (By the way, q is also a function allowing you to quit R.)

Regarding parser.R line 300-306: instead of querying the signed variable on every pass through the loop, you can just run abs() on the OI_buckets variable at the end. This only saves space but doesn't save time.

Regarding parser.R line 326: you don't need to check that i == length(trades[,1]) because you discard the information calculated anyway. Removing that check saves about a quarter second.

rcyeh commented 10 years ago

Running the new parserPerformanceTest.R, I get about 45 results per minute. Then 675 runs should take about 15 minutes, which is about ten times faster than the three hours we estimated before. Hope this helps!

source("Desktop/cfem2013/RScripts/parserPerformanceTest.R") [1] "Fri Sep 20 16:40:03 2013 1000 30 1000 : 2.90622393109192e-05" [1] "Fri Sep 20 16:40:05 2013 2000 30 1000 : 5.52681657258449e-05" [1] "Fri Sep 20 16:40:07 2013 3000 30 1000 : 0.000695759671252027" [1] "Fri Sep 20 16:40:08 2013 4000 30 1000 : 0.000174900071201524" [1] "Fri Sep 20 16:40:09 2013 5000 30 1000 : 8.13670670774862e-05" [1] "Fri Sep 20 16:40:10 2013 6000 30 1000 : 0.000615204149652122" [1] "Fri Sep 20 16:40:11 2013 7000 30 1000 : 0.00041035187667894" [1] "Fri Sep 20 16:40:12 2013 8000 30 1000 : 0.000103289822491547" [1] "Fri Sep 20 16:40:13 2013 9000 30 1000 : 0.000211292538982498" [1] "Fri Sep 20 16:40:14 2013 10000 30 1000 : 0.000933927294985178" [1] "Fri Sep 20 16:40:15 2013 11000 30 1000 : 0.000101560393780263" [1] "Fri Sep 20 16:40:16 2013 12000 30 1000 : 0.000752362122897854" [1] "Fri Sep 20 16:40:17 2013 13000 30 1000 : 0.00651831350790621" [1] "Fri Sep 20 16:40:18 2013 14000 30 1000 : 0.00255901017453809" [1] "Fri Sep 20 16:40:18 2013 15000 30 1000 : 0.0158898870803131" [1] "Fri Sep 20 16:40:22 2013 1000 30 2000 : 0.000170173146627243" [1] "Fri Sep 20 16:40:24 2013 2000 30 2000 : 9.65272318007709e-05" [1] "Fri Sep 20 16:40:25 2013 3000 30 2000 : 0.000520496166736829" [1] "Fri Sep 20 16:40:26 2013 4000 30 2000 : 0.000146542099323157" [1] "Fri Sep 20 16:40:28 2013 5000 30 2000 : 0.00425882670557528" [1] "Fri Sep 20 16:40:29 2013 6000 30 2000 : 0.0010560735331781" [1] "Fri Sep 20 16:40:30 2013 7000 30 2000 : 0.000112645860289389" [1] "Fri Sep 20 16:40:31 2013 8000 30 2000 : 0.0040908733564638" [1] "Fri Sep 20 16:40:32 2013 9000 30 2000 : 0.000144320323078425" [1] "Fri Sep 20 16:40:33 2013 10000 30 2000 : 0.00029788406747829"

zc238 commented 10 years ago

Thanks so much for the extremely detailed analysis. I think I got three important things out of it; one is using Rprof, next is try to perform ops on vectors whenever possible; which can save a lot of time; and $col has such a performance advantage of [](however indexing is implemented in R). It is very helpful.

Regarding line 40-54, you are correct. This is a bug. Fortunately, the else loop currently is not called in all the results we presented, but I will correct it. And the tips are very helpful for other sections of the code you outlined. I'll close the issue shortly. Thanks!

zc238 commented 10 years ago

The additional lines for pre-processing (use only the nearest quote points) can be found: filter_trades_quotes <-function(a, volume_limit=10000). This function returns trades+quotes, so if there are X trades, the function will return ~2*X entries. It takes additional entries (one index prior to all the trades indexes), since trades are relatively sparse comparing the quotes, this function return always contain information about the latest one quote entry prior to each trade.

As for EMA, the function I used is at: filter_trades_quotes_EMA <- function(a, time_decay, volume_limit=10000). The function takes an additional decay parameter, and the end output is the same as filter_trades_quotes, the only difference is instead of "only retrieving the one quotes prior to each trade entry", it considers all the quotes within the time decay specified, and computes EMA of this array, afterwards it combines all the EMA of quotes midpoints, along with all the trades entries. This process is very slow.

For the actual code, I'll upload very shortly.

zc238 commented 10 years ago

After sourcing parser.R, you can still use parserPerformanceTest.R line 1 - 30. For later runs we will not optimize over threshold and reduce optimizing range for bin/bucket_size as well, but for now. I think this segment of codes can serve as a good performance test scenario.

rcyeh commented 10 years ago

OK, I see the filter_trades_quotes function, which seems to take 3 seconds for AMZN on my computer.

I think there are some major logic problems with this function.

I am going to guess that you meant to keep just trades up to the volume limit and without certain condition codes, and to try to get the prior array element (which is assumed, but not guaranteed, to be a quote).

RProf says:

$by.self
               self.time self.pct total.time total.pct
parser.R#5          1.16    32.77       1.48     41.81
"match"             0.52    14.69       2.22     62.71
"seq.default"       0.24     6.78       0.46     12.99
"qsplit"            0.22     6.21       1.70     48.02
parser.R#7          0.16     4.52       2.72     76.84
"FUN"               0.16     4.52       2.62     74.01
"/"                 0.16     4.52       0.16      4.52
"vapply"            0.12     3.39       0.14      3.95
"lapply"            0.10     2.82       2.72     76.84
"seq"               0.10     2.82       0.58     16.38
"[.data.frame"      0.10     2.82       0.12      3.39
"%in%"              0.08     2.26       2.32     65.54
parser.R#400        0.08     2.26       0.68     19.21
"pmin"              0.08     2.26       0.22      6.21
"c"                 0.08     2.26       0.08      2.26
"floor"             0.06     1.69       0.06      1.69

$by.total
                       total.time total.pct self.time self.pct
"filter_trades_quotes"       3.54    100.00      0.00     0.00
"which"                      2.74     77.40      0.00     0.00
parser.R#394                 2.74     77.40      0.00     0.00
parser.R#7                   2.72     76.84      0.16     4.52
"lapply"                     2.72     76.84      0.10     2.82
"hasq"                       2.72     76.84      0.00     0.00
"unlist"                     2.72     76.84      0.00     0.00
"FUN"                        2.62     74.01      0.16     4.52
"%in%"                       2.32     65.54      0.08     2.26
"match"                      2.22     62.71      0.52    14.69
"qsplit"                     1.70     48.02      0.22     6.21
parser.R#5                   1.48     41.81      1.16    32.77
parser.R#400                 0.68     19.21      0.08     2.26
"seq"                        0.58     16.38      0.10     2.82
"seq.default"                0.46     12.99      0.24     6.78

I rewrote to:

filter_trades_quotes2 <- function(a, volume_limit=10000){ #designed to reduce the # of quotes necessary for processing data
  indd <- which(a$type == 'T' & a$size <= volume_limit);
  indd <- indd[! (hasq(32, a$quals[indd]) | hasq(59, a$quals[indd]))];
  ind = unlist(as.list(rbind(indd - 1, indd)));
  ind = ind[ind > 0];
  trades_quotes <- a[ind,]
  return (trades_quotes)
}

And it takes under 0.45 seconds instead of over 3.5 seconds. (Committed at 3b48431feb9b0592c4c5f5a79c6b8ff4cb16ea72.) Rprof still says that the condition-checking is taking the bulk of the time. Perhaps it would be worth reducing the files by filtering out all unwanted quotes and trades in one pass.

Now, stepping back, you are still making a basic assumption that the message prior to each wanted trade is a quote. (It could also be an unwanted trade, which this code doesn't exclude.) If you want to get the most recent quote prior to each trade, you can just copy the data following the technique at https://stat.ethz.ch/pipermail/r-help/2012-March/305223.html

  # extending the values forward 
  values <- c(0, a$bid[a$bid > 0])
  ind <- cumsum(a$bid > 0) + 1
  a$bid <- values[ind]

This took under 0.1 seconds for a single column.

rcyeh commented 10 years ago

This version takes 0.2 seconds, less than half the time of the earlier one. It's based on doing the calculations in qsplit and hasq on vectors instead of element wise and removing the which:

filter_trades_quotes3 <- function(a, volume_limit=10000) {
  q1 <- a$quals %% 256
  q2 <- floor(a$quals / 256) %% 256
  q3 <- floor(a$quals / 256 / 256) %% 256
  q4 <- floor(a$quals / 256 / 256 / 256)
  keep <- a$type == 'T' & a$size <= 2000 & q1 != 32 & q2 != 32 & q3 != 32 & q4 != 32 & q1 != 59 & q2 != 59 & q3 != 59 & q4 != 59
  trades_quotes <- a[keep | c(keep[2:length(keep)], FALSE),]
  return (trades_quotes)
}

Committed at 19d9b5cb77faab5ca65077fb408a523ed81948f5

zc238 commented 10 years ago

Thanks. Jia will comment on this a bit later, too. But one thing that has been puzzling me is this: this function supposedly doubles the number of entries needed to be processed for the parser (< 2X since there exist trades that are next to each other), but the processing time used by calc_OI_by_time_buckets more than doubled on my machine. (8s vs. 25s for AMZN)

zc238 commented 10 years ago

I see the logically error you were referring to, you were right!

rcyeh commented 10 years ago

Regarding the EMA, I have the following comments:

1 = (constant) weight of each observation
x = value of each new observation
t = time delta of each new observation relative to the previous observation

a = exp(-t / decay)
running sum = previous running sum * a + x
running weight = previous running weight * a + 1

value = running sum / running weight

Please implement this in R. Try to use the cumsum and diff functions.

zc238 commented 10 years ago

will do.

jx226 commented 10 years ago

Thanks for the comment! The prolonged parsing process seems to be related to the delay_quotes_xms() function. After applying this function, it somehow slows down the parsing. We'll try to address this problem.

rcyeh commented 10 years ago

@jx226 , I just looked at delay_quotes_xms(). For AMZN, it took over 0.5 seconds.

delay_quotes_xms <- function(data_a, delay_time){
  #options(digits.secs=6)
  #data_a$time <- strptime(data_a$time,"%H:%M:%OS")
  data_a[data_a$type == 'Q',]$time <- data_a[data_a$type == 'Q',]$time + delay_time
  return(data_a[with(data_a, order(time)), ])
}

delay_quotes_xms2 <- function(data_a, delay_time){
  #options(digits.secs=6)
  #data_a$time <- strptime(data_a$time,"%H:%M:%OS")
  data_a$time[data_a$type == 'Q'] <- data_a$time[data_a$type == 'Q'] + delay_time
  return(data_a[order(data_a$time), ])
}

(Committed at 967cd008b8e558ef015ee8cb40bf0d52086a0795.) delay_quotes_xms2 takes under 0.16 seconds, about 1/3 as long. I would encourage everyone to review the use of indexing everywhere in the code.

But you mentioned that this function slows down the parsing. I can see that in jia.R, line 18 is unnecessary because you've already sorted by time in the delay_quotes_xms function. Are you saying that the function itself seems to change the data in a way that makes the rest of the processing slow?

Also, instead of using a$time <- a$time - a$latency*0.001, why not just use a$exchange_time (milliseconds)? Sure, you give up microsecond resolution, but that shouldn't be a big deal here.

jx226 commented 10 years ago

@rcyeh Thanks for the reply.

My observation from processing the data was that the delay_quotes_xms function had slowed down the processing. But I will pay attention to this in future analysis.

Also, I will use the exchange_time instead of adjusting the time with latency in the codes.

I finished writing function to calculate Quotes EMA in parser.R. It takes 1.28 seconds for AMZN. I think the result as the following is satisfactory. Please take a look and comment.

$by.self self.time self.pct total.time total.pct "FUN" 0.76 59.38 1.06 82.81 "mean.default" 0.30 23.44 0.30 23.44 "apply" 0.06 4.69 1.16 90.62 "*" 0.04 3.12 0.04 3.12 "plot.xy" 0.04 3.12 0.04 3.12 "lapply" 0.02 1.56 0.04 3.12 "axis" 0.02 1.56 0.02 1.56 "data.frame" 0.02 1.56 0.02 1.56 "plot.new" 0.02 1.56 0.02 1.56

$by.total total.time total.pct self.time self.pct "cal_quotes_EMA" 1.28 100.00 0.00 0.00 "standardGeneric" 1.24 96.88 0.00 0.00 "apply" 1.16 90.62 0.06 4.69

4 1.16 90.62 0.00 0.00

"FUN" 1.06 82.81 0.76 59.38 "mean.default" 0.30 23.44 0.30 23.44 "plot.default" 0.06 4.69 0.00 0.00 "plot" 0.06 4.69 0.00 0.00

6 0.06 4.69 0.00 0.00

"*" 0.04 3.12 0.04 3.12 "plot.xy" 0.04 3.12 0.04 3.12 "lapply" 0.04 3.12 0.02 1.56 "unlist" 0.04 3.12 0.00 0.00 "axis" 0.02 1.56 0.02 1.56 "data.frame" 0.02 1.56 0.02 1.56 "plot.new" 0.02 1.56 0.02 1.56 "Axis.default" 0.02 1.56 0.00 0.00 "Axis" 0.02 1.56 0.00 0.00 "lines.default" 0.02 1.56 0.00 0.00 "lines" 0.02 1.56 0.00 0.00 "localAxis" 0.02 1.56 0.00 0.00

3 0.02 1.56 0.00 0.00

5 0.02 1.56 0.00 0.00

7 0.02 1.56 0.00 0.00

$sample.interval [1] 0.02

$sampling.time [1] 1.28

rcyeh commented 10 years ago

@jx226 I see cal_quotes_EMA and here are my comments:

  1. I think the code is correct.
  2. Good choice to divide the millisecond exchange times by 1e5 --- this avoids overflow problems. I would even suggest dividing more, in case someone inputs an alpha = 0.1 or 0.01. A day has about 2e4 seconds, and exp(1e3) returns Inf; so we can't expect to use a time decay faster than 20 seconds. Hmm, that may be a problem.
  3. I have a feeling that 1.28 seconds (2.94 seconds on my five-year-old Pentium Dual-Core) to calculate an EMA on 100,000 points is way too slow. So, I think it's good we have time during this project to practice and get a sense of what we should expect. In the profiling results, you see lots of calls to mean and apply, and we shall fix those. Replacing:
  mid_quotes <- apply(data.frame(a$bid[a$type == "Q"],a$ask[a$type == "Q"]),1,mean)

with

  mid_quotes <- 0.5 * (a$bid[a$type == "Q"] + a$ask[a$type == "Q"])

cuts the time for the whole function to 0.08 seconds on my old laptop, which I have committed at a86a7952fbc14d297ad6493611532650599c0d90.

It's not that apply is slow --- actually, it's at least as fast as any for loop processing element-wise. But I suspect mean of two elements is slow. Further, mean may do special checking for NA values, which the arithmetic version doesn't do.

  1. I just git pulled the latest version of the repository and see that you have added steps to sort a, add the ema_mid column to a, and to filter_trades_quotes3. I suggest you sort a on loading from h5read, not when you calculate the EMA, because you might calculate the EMA multiple times. Also, for organizational purposes, I suggest you have a single function that calculates the ema, and a separate one that puts it into the data frame and filters the data frame.
jx226 commented 10 years ago

@rcyeh Thanks for commenting and improving the function. I reorganized the EMA function as you suggested in the post.

I also checked the AMZN trading hours, it is actually 9.1 hours (3e4 sec). And the calculable exp(x) is around exp(700). So we can't even expect to use a time decay faster than 40 seconds. I suggest that we can split the data into two part if needed to get a decay quicker than 40 seconds.

I think this hours indicates that there are before/after trades included in the data. Given the forma. t of time, it is hard to find the exactly actual time of the trades as you can see in the following trades data. So I looked at the first ten trades and last ten trades on 20130423.

I found there were trades from various exchanges D,Q,P,D,K in both parts. I thought exchanges only trade from 9:30am to 4:00pm. Could you please explain this for me? Also, what does the variable sub_market mean?

First ten trades:

        time        latency symbol exchange exchange_time seq_no  price size volume     quals

128 2013-10-01 12:05:36.207671 2 AMZN D 1.366719e+12 1748 265.00 2500 2800 371064980 6180 2013-10-01 12:31:22.492084 3 AMZN Q 1.366720e+12 2266 264.50 300 3100 30 6944 2013-10-01 12:37:21.318047 4 AMZN Q 1.366721e+12 2403 264.30 100 3200 1971968 6946 2013-10-01 12:37:21.384664 3 AMZN P 1.366721e+12 2404 264.40 100 3300 1971968 6952 2013-10-01 12:40:13.243370 4 AMZN P 1.366721e+12 2469 264.00 100 3400 30 6956 2013-10-01 12:40:57.665597 3 AMZN Q 1.366721e+12 2481 263.80 100 3500 1971968 6963 2013-10-01 12:45:17.005688 3 AMZN Q 1.366721e+12 2549 263.98 100 3600 1971968 7638 2013-10-01 13:00:52.189858 4 AMZN K 1.366722e+12 2757 263.97 100 3700 30 7885 2013-10-01 13:14:22.975349 4 AMZN D 1.366723e+12 3083 264.00 100 3800 30 7889 2013-10-01 13:16:27.974699 3 AMZN P 1.366723e+12 3141 264.00 100 3900 30 market_status instrument_status thru_exempt sub_market line type 128 66 66 NA Q 513 T 6180 66 66 NA D 513 T 6944 66 66 NA D 513 T 6946 66 66 NA D 513 T 6952 66 66 NA D 513 T 6956 66 66 NA D 513 T 6963 66 66 NA D 513 T 7638 66 66 NA D 513 T 7885 66 66 NA Q 513 T 7889 66 66 NA D 513 T

Last ten trades: time latency symbol exchange exchange_time seq_no price size volume 106871 2013-10-01 20:30:31.989252 3 AMZN Q 1.366749e+12 1229992 268.7900 400 2266216 106877 2013-10-01 20:35:10.068124 4 AMZN D 1.366749e+12 1230151 268.9000 604 2266820 106898 2013-10-01 20:45:39.616956 3 AMZN D 1.366750e+12 1230791 268.8000 100 2266920 106899 2013-10-01 20:46:20.167681 3 AMZN Q 1.366750e+12 1230805 268.8000 100 2267020 106905 2013-10-01 20:50:03.085789 3 AMZN P 1.366750e+12 1231012 269.1900 200 2267220 106908 2013-10-01 20:50:03.087305 3 AMZN P 1.366750e+12 1231013 269.1900 200 2267420 106909 2013-10-01 20:50:07.584389 4 AMZN P 1.366750e+12 1231193 269.1900 100 2267520 106981 2013-10-01 21:11:09.936298 4 AMZN D 1.366751e+12 1245609 268.8999 2195 2269715 106985 2013-10-01 21:11:35.663446 3 AMZN K 1.366751e+12 1245689 269.2000 199 2269914 106990 2013-10-01 21:12:45.091304 4 AMZN K 1.366752e+12 1245930 269.2000 100 2270014 quals market_status instrument_status thru_exempt sub_market line type 106871 1971968 82 82 NA D 513 T 106877 371064832 82 82 NA Q 513 T 106898 30 82 82 NA Q 513 T 106899 1971968 82 82 NA D 513 T 106905 1971968 82 82 NA D 513 T 106908 1971968 82 82 NA D 513 T 106909 1971968 82 82 NA D 513 T 106981 371064832 82 82 NA Q 513 T 106985 30 82 82 NA D 513 T 106990 30 82 82 NA D 513 T

In case you need to run this on your laptop, I attached the code: a <- h5read("ticks.20130423.h5", "/ticks/AMZN", bit64conversion='double') options(digits.secs=6) a$time <- strptime(a$time,"%H:%M:%OS") trades <- a[a$type == 'T',unlist(strsplit("time|latency|symbol|exchange|exchange_time|seq_no|price|size|volume|quals|market_status|instrument_status|thru_exempt|sub_market|line|type", "|"))] max(trades$time)-min(trades$time) trades[1:10,] n <- dim(trades)[1] trades[(n-9):n,]

rcyeh commented 10 years ago

@jx226 I see there is this overflow problem. Note that we want to decay quoted mid-market prices, not trades; but the same problem remains.

I don't understand what you mean that it is hard to tell what time each message occurred. I agree that the exchange_time field (in milliseconds) is hard to read when displayed with scientific notation; but you have the time field (in UTC).

There are pre-market and after-market trades on many exchanges --- hence the "Extended hours" condition code, which you can filter out. As for sub-market, see the second reference at the bottom of https://github.com/rcyeh/cfem2013/blob/master/reference_data/tick_file_formats.txt, which should lead you to https://www.nasdaqtrader.com/content/technicalsupport/specifications/utp/utdfspecification.pdf. I don't think it is a useful field.

If I do a non-overflowing cal_quotes_EMA3

dt <- c(1e3, diff(a$exchange_time[a$type == 'Q']))
iw <- exp(-dt*alpha)
mid <- 0.5 * (a$bid[a$type == 'Q'] + a$ask[a$type == 'Q'])
emamid <- mid
rs <- 0;
rw <- 0;
for (i in c(1:length(emamid))) {
  rs <- rs * iw[i] + mid[i];
  rw <- rw * iw[i] + 1;
  emamid[i] <- rs / rw;
}
return emamid;

then it's slow again, but the for loop is still 2x-3x as fast as:

rs <<- 0
rw <<- 0
emamid <- apply(data.frame(iw, mid), 1, function (x) { rs <<- rs * x[1] + x[2]; rw <<- rw * x[1] + 1; return (rs/rw); })

I don't have anything faster at this time.

zc238 commented 10 years ago

Hi, I am happy with the current performance. It is not the fastest possible, but I think we can work with it. As for the decay rate parameter, though we are considering all historical quotes instead of just a pre-specified interval, it should achieve the general purpose.

Just a precursor to the draft report, we find regressing SOI*cross term against EMA(MID QUOTES) yield great results for both concurrent and one step ahead prediction. But when applied to the quotes, the results are not as satisfactory.

jx226 commented 10 years ago

@rcyeh Thanks for the prompt reply.

By saying that I am not understanding the time, I mean that the time for first trade is given as "2013-10-01 12:05:36.207671" . So I was confused by this time. I thought I should use this time to identify the non-trading hour/trading hour trades.

Also thanks for explaining the "Extended hours" condition code to me. Since we have hasq() to filter out non-trading hour trades. I think it is no longer necessary to figure out the right display of the time variable for identification purpose.

rcyeh commented 10 years ago

@zc238 I don't follow exactly. What is the cross term? What are you applying to quotes? What are great results and unsatisfactory results?

@jx226 12:05:36.... UTC is 8:05:36... EDT. 9:30 EDT is 13:30 UTC.

zc238 commented 10 years ago

@rcyeh So lm(log(EMA[QMid(t)]/EMA[QMid(t-1)]) ~ SOI(t) * Var(bucket prices(t))) and lm(log(EMA[QMid(t+1)]/EMA[QMid(t)]) ~ SOI(t) * Var(bucket prices(t))) give good R^2 for certain stocks, but if we do buy/sell classification based on EMA(QMid(t)), and regress QMid(t) against SOI(t) * Var(bucket prices), aka the old approach except changing the buy/sell classification rule based on EMA of mid quotes, it does not improve the results by much. More details will be in the report.