rcyeh / cfem2013

Cornell Financial Engineering Manhattan 2013 Project
1 stars 1 forks source link

Build code to analyze trading venue factors #11

Open jf434 opened 11 years ago

jf434 commented 11 years ago

Build up the code to analyze the contribution of time, quote exchange, spread, tick size (numerical factors available in the existing data-set) on how they affect the contribution to the trading venue.

rcyeh commented 11 years ago

I think some of the factors are longer-term or structural and others are shorter-term in nature. For example, primary listing exchange usually does not change from day to day; but the exchange quote that forms part of the NBBO may change from quote to quote.

To analyze structural factors, I would aggregate shares traded over the entire day for each symbol. (Or alternatively, put shares traded in the first and last 10 minutes in one group, and between 9:40 and 15:50 in another group.) I might then re-aggregate by primary listing exchange, average price range, and median spread.

I think the exchange distribution based on shorter-term factors may be harder to do, partly because we have only supplied the NBBO quotes. (Our data from September 2011 through February 2013, in a different format and not yet provided, include top-of-exchange quotes instead of NBBO quotes and may be more helpful for this.)

jf434 commented 11 years ago

When aggregate the data by time, it seems that there is no shares traded for the first and last 10 mins. So it left us with three groups: "morning/mid-day/afternoon". There seems to be some small contributions from price range and primary exchange. But those are based on two tickers(CSCO/AMZN) only, may need to expand to a larger set of tickers to validate this.

I would consider the speed/latency to be another factor. Intuitionally, trader will opt for an exchange with a smaller delay given the same level of price and exchange with more processing power will be at an advantage. I wonder is the "latency" provided in our data set the trading delay in the system or is the delay for the quote?

kz93 commented 11 years ago

latency is defined as (time - exchange_time) in milliseconds I guess it will be the delay for transmitting the order to exchange?I agree it would be a good factor to consider

rcyeh commented 11 years ago

To amplify, the latency field is the delay between when the quote disseminator generated the message and when we received the message. Different market participants will have slightly different delays in transmitting orders, and this is not related to the latency data column.

I am surprised that there are no shares traded in the first and last 10 minutes. Can you put the code you used to perform the aggregation into git so that I can take a look?

I do not understand your comment "There seems to be some small contributions from price range and primary exchange. ... based on two tickers only." Can you describe your model (also commit to git). When I suggested price range versus tick size, I was meaning to compare across tickers, not within a single ticker, because I thought the effect would be clearer when seen across a larger price range. I didn't expect the effect would be observable within the price range of a single ticker but would be interested to know if you can resolve that.

jf434 commented 11 years ago

For now, we are only looking at a few tickers one at a time to see if there is any pattern of interest. We want to see if there is anything of interest before committing to fully develop the code across all the tickers. I won't consider that to be a model: We aggregate the data by time into 4 groups. Then for each group, we look at the size of the trades against exchange for different tickers. I am looking to expand this to all the tickers to remove the ambiguity.

What we naively observe (based on few tickers) so far are:

  1. There is wider amount of distribution of prices for the ticker that is traded on primary exchange
  2. There are some fluctuation of trading exchange for each time groupings, but they seem minor.

The latest code are committed under /RScripts/ along with some plots generated by the code. The aggregate code file is "Venue_Analytics.R". There is also a Python parser that reads in the ticker and then calls Yahoo Finance API to look up the primary exchange. I wonder if there an easier to accomplish this?

For aggregation, trading quote starts at 8:00 am EST and ends at around 17:15 EST, which corresponds to 12:00 UTC and 21:15 UTC. From there, we develop the aggregation using the converted UTC time. I select all the time before 12:10 UTC (first 10 minutes into trading), 12:10 to 13:40 (morning), 13:40 to 19:50 (mid-day), 19:50 to 21:05 (afternoon) and after 21:05 (last 10 minutes). I wonder if I made a mistake somewhere.

On Wed, Sep 11, 2013 at 9:21 AM, Richard C Yeh notifications@github.comwrote:

To amplify, the latency field is the delay between when the quote disseminator generated the message and when we received the message. Different market participants will have slightly different delays in transmitting orders, and this is not related to the latency data column.

I am surprised that there are no shares traded in the first and last 10 minutes. Can you put the code you used to perform the aggregation into git so that I can take a look?

I do not understand your comment "There seems to be some small contributions from price range and primary exchange. ... based on two tickers only." Can you describe your model (also commit to git). When I suggested price range versus tick size, I was meaning to compare across tickers, not within a single ticker, because I thought the effect would be clearer when seen across a larger price range. I didn't expect the effect would be observable within the price range of a single ticker but would be interested to know if you can resolve that.

— Reply to this email directly or view it on GitHubhttps://github.com/rcyeh/cfem2013/issues/11#issuecomment-24239010 .

rcyeh commented 11 years ago

Thanks for pointing me to the code. I saw it earlier but didn't read it carefully enough. I would have thought to do something like:

aggregate(trades$size, by=list(exchange=trades$exchange, timegrp=trades$timegrp, ...), FUN=sum)

Primary exchange: get < ftp://ftp.nasdaqtrader.com/SymbolDirectory/otherlisted.txt > and look in the third column "Exchange". The codes are the same as in the quotes: N = New York, A = Amex, P = Arca, Z = BATS. Nasdaq-listed stocks are at < ftp://ftp.nasdaqtrader.com/SymbolDirectory/nasdaqlisted.txt >. This is a current snapshot, which is good enough to get started, because during the period of the data, there have not been many exchange changes.

In the US equity market, market hours are from 9:30 to 16:00 (13:30 to 20:00 UTC during the summer, which is the period covered in the data you have). The volumes prior to 9:30 and after 16:00 are very small and should be excluded because the market has many fewer participants outside of market hours. Possibly interesting, but we can narrow our scope to focus on the volumes between 9:30 and 16:00. I would define the first 10 minutes to be 13:30-13:40 and the last 10 minutes to 19:50-20:00.

For your point 2, on the fluctuation of trading exchange for each time grouping, are you saying that you have a distribution of volumes by trading exchange, and that the distribution does not change much from one time period to the next, for those two stocks? What would be a good way to quantify the distribution and the change therein?

jf434 commented 11 years ago

Cool. Thanks for the tip for getting the primary exchange. That is a lot faster than manually call up the Yahoo API one by one. Even if I convert this to a threading version to run parallel, the python script was quite slow.

For the time, I got the time wrong in the previous aggregation. Now the program no longer came back empty with the first 10 mins of the trading.

Sorry about the confusion for point 2, now I changed the code with the correct time aggregation, there are some slight changes: what I noticed was there is a distribution that does not change from one stock to another. The early trading period (13:30-13:40), there are more trades going through exchange Q; for intra-day (13:40-19:50), the trades are more or less evenly distributed across exchanges; for near end of day (19:50-20:00), there are more trades going through the exchange D. But what is D? D is the trading display facility. So it is not an exchange and we can exclude it. The remaining trades going through the exchange equally. To sum it up, it does change from one time period to another, but it shows similar changes across the three tickers. But there could be bias as three tickers are all technology stocks (MSFT/CSCO/YHOO).

rcyeh commented 11 years ago

That's a very interesting observation!

When you say "more trades", do you mean the count of T events or the sum of shares? If the latter, does Nasdaq (Q) still dominate if you exclude the opening cross trade (with qualifier condition 53 (Opening Prints))?

For Nasdaq and Arca-listed stocks, the opening cross trade will come soon after 9:30:00.000. For NYSE- and Amex-listed stocks, the opening cross trade can come late --- as late as anytime, but 9:40 is a good cutoff.

D is the trade-reporting facility for off-exchange prints. Don't exclude this --- it's significant. What's the size of the difference?

jf434 commented 11 years ago

I meant both. Interestingly, they agree with each other. What I did before was select Qualifier 0,6, 23, 33, 53, 58, 59. Now if I exclude the condition 53, the result I got is exchange D dominates with Q comes second for the first 10 mins (D is roughly 1.5-2 times the T count and shares). The midday and last 10 mins are dominated by D (account for 40-50% of T count and shares). This observation holds for all three tickers.

The difference between the first place (Q) and second place exchange (D or Z depending on the time) are significant. It is around (1.5 to 2X) for first 10 mins, and roughly (2-3X) for midday and last 10 mins.

rcyeh commented 11 years ago

Wait, I didn't entirely understand your last message. In your first paragraph, you said that D dominates. In the second paragraph, you said that Q is in first place. Is that a contradiction?

jf434 commented 11 years ago

D should dominates. You are absolutely right and it was a mistake in the second paragraph. It should be D.

rcyeh commented 11 years ago

OK, thanks. Let's expand the analysis to all stocks for that one day.

jf434 commented 11 years ago

The analysis was expanded for all stocks listed on Q. In the first 10 mins, the flow is dominated by D. (D is roughly 3.5X times the Q on shares). In the intraday, the flow is strongly dominated by D. (D makes up 50% of all the shares flows, D is roughly 2X time Q on shares) In the last 10 mins, same as the first 10 mins.

For stocks listed on other exchange, the trends are mostly similar with D dominating and the second busiest being the primary exchange. The result file is available under /RScript/Trade_Flow_summary

There were some errors when reading all the tickers listed on Q. Not sure about the exact message, but the RStudio just said it "encountered a fatal error". It took some time to figure out "QQQ" and "TQQQ" are the tickers that causes the problems. If I read those two tickers from the h5 file, the program will give errors and my suspicion is the data for those two are corrupted. Since those two are not really equity, I took out those out along with the test tickers(Ex: ZXZZT, ZVZZT) when performing the above analysis.

Now I will break the D flows into D1 and D2. I am thinking about defining D1 as D trade flows that trade at +/-5% of mid of previous quote. And D2 are the rest of the D flows.

On Thu, Sep 12, 2013 at 3:44 PM, Richard C Yeh notifications@github.comwrote:

OK, thanks. Let's expand the analysis to all stocks for that one day.

— Reply to this email directly or view it on GitHubhttps://github.com/rcyeh/cfem2013/issues/11#issuecomment-24349919 .

jf434 commented 11 years ago

Upon analyzing the D trade flows, it seems the trade price are all quite close to the mid. Doing 5% boundary doesn't separate any flows as the trade are generally around 0.01% of the previous quote mid. So the D1 is now defined as +/-0.02% of the mid of previous quote. I wonder if that is too tight of a bound to be insightful

rcyeh commented 11 years ago

When you're saying percentage, are you calculating it as the percentage of the mid price, or as a percentage of the spread? I would use spread, because spreads vary widely from 1-100 (and more) basis points of the mid price.

I think the threshold should be: D1 = trades where price = mid = 0.5 (prevailing bid price + prevailing ask price). I don't think you need to have a tolerance. If indeed you want to supply a tolerance, try using mid +/- $0.0001 (absolute), not a relative fraction.

rcyeh commented 11 years ago

I will take a look at the QQQ and TQQQ.

jf434 commented 11 years ago

I meant percentage of mid. I redo the calculation using the prevailing bid, prevailing ask to get the mid. The result is that D1 comprises of about 10-20% of all the D flows regardless of the time. I set the tolerance to +/-0.0001 absolute as you suggested as there is a small rounding error 0.0000001 in computation.

rcyeh commented 10 years ago

Regarding QQQ and TQQQ, on my Windows computer I was able to do:

library(rhdf5) a <- h5read("C:/data/h5/ticks.20130423.h5", "/ticks/QQQ", bit64conversion="double") colnames(a) a$hhmm = substring(a$time, 0, 5) head(a$hhmm) table(a$hhmm) a$session <- ifelse(a$hhmm < "13:30", "pre-mkt", "930-940") a$session <- ifelse(a$hhmm < "13:40", a$session, "940-1550") a$session <- ifelse(a$hhmm < "19:50", a$session, "1550-1600") a$session <- ifelse(a$hhmm < "20:00", a$session, "aft-mkt") table(a$session) aggregate(a$size, by=list(exchange=a$exchange, session=a$session), FUN=sum)

without any errors. QQQ and TQQQ do have significant quoting activity, so these data frames are large. Can you figure out what else might be causing the error? Is it possible that your process ran out of memory?