quotesCleanUp to deal with quotes data from SIP source

stucash commented 2 years ago

I was trying to clean up the quotes data requested from a SIP source (e.g., Polygon.io: https://polygon.io/docs/stocks/get_v3_quotes__stockticker), which is directly from the exchange.

I found that the package took in EX (exchange) and SIP source gave me AskExchange and BidExchange; this is going to cause trouble during clean-up.

sampleQDataRaw:

                             DT EX    BID BIDSIZ    OFR OFRSIZ SYMBOL
     1: 2018-01-02 04:04:13.125  P 156.57      1 158.85      1    XXX
     2: 2018-01-02 04:05:05.979  P 156.55      1 158.85      1    XXX
     3: 2018-01-02 04:05:44.750  P 156.58      1 158.85      1    XXX
     4: 2018-01-02 04:05:45.190  P 156.58      1 158.85      1    XXX
     5: 2018-01-02 04:08:44.904  P 156.61      1 158.86      1    XXX
    ---                                                              
131397: 2018-01-03 16:00:45.029  T 155.80      2 157.74      1    XXX
131398: 2018-01-03 16:02:16.849  P 157.01      4 157.59      4    XXX
131399: 2018-01-03 16:05:18.740  N 157.26      1 157.28      1    XXX
131400: 2018-01-03 16:19:31.470  K 157.25      2 157.66      3    XXX
131401: 2018-01-03 19:57:31.880  K 157.15      1 157.66      3    XXX

My sample SIP data:

                        DateTime Symbol AskEx AskPrice AskSize BidEx BidPrice BidSize Conditions Tape
      1: 2022-05-11 08:00:00.130     GS     K   328.00       1     P   305.40       1      ['R']    A
      2: 2022-05-11 08:00:44.471     GS     K   328.00       1     P   302.35       1      ['R']    A
      3: 2022-05-11 08:00:44.705     GS     K   328.00       1     P   301.66       1      ['R']    A
      4: 2022-05-11 08:00:44.822     GS     K   328.00       1     K   300.00       1      ['R']    A
      5: 2022-05-11 08:01:42.138     GS     P   318.69       4     K   300.00       1      ['R']    A
     ---                                                                                             
1149996: 2022-05-11 17:22:54.851    BAC     N    35.93      48     N    35.92      30      ['R']    A
1149997: 2022-05-11 17:22:54.888    BAC     N    35.93      48     N    35.92      45      ['R']    A
1149998: 2022-05-11 17:22:54.888    BAC     N    35.93      60     N    35.92      45      ['R']    A
1149999: 2022-05-11 17:22:54.951    BAC     N    35.93      60     N    35.92      30      ['R']    A
1150000: 2022-05-11 17:22:54.959    BAC     N    35.93      60     N    35.92      51      ['R']    A

Please ignore the irrelevant columns, the whole purpose is to demonstrate this exchange thing.

Am I missing something before approaching the clean-up? Or it actually doesn't really matter which one to pick, be it AskExchange, or BidExchange, in the case of SIP Data? The question here, for me, is that the AskExchange and BidExchange are not identical, it changes what remains after the clean-up.

Thanks for your help Team!

onnokleen commented 2 years ago

Hi stucash, I don't know your data provider, so I can't give detailed feedback. Depending on your problem at hand, it might be better to focus on only one exchange anyway. But your mileage may vary depending on your application.

onnokleen commented 2 years ago

@kboudt: I think you can close this issue because this particular question of data cleaning is more related to how @stucash's data is constructed.

stucash commented 2 years ago

@onnokleen Thanks for reverting back to me; are you referring to Polygon.io, as the data provider you don't know? Securities Information Processor (SIP) is the data distributor used across the US markets (or exchanges). A list of exchanges that are using SIP can be found here: https://www.ctaplan.com/index

And Polygon.io gives SIP data directly to the user.

I understand that eventually, I'll have to settle on one exchange but the problem for me is, that I don't know which exchange sequence to pick (AskExchange, or BidExchange).

I read the paper dedicated to highfrequency as well and I can see the data used was from Wharton Research Data Service(WRDS); are we saying that the tick data from WRDS should resemble the raw tick data from exchange (and it is SIP data as well)?

Could you share some reliable data sources that have an identical data structure with WRDS tick data? Or is there a heuristic I could use to just pick the exchange for quotes data? At the end of day, what I wanted to do was to use the package to run analysis in a reliable way and if I can't clean the data properly, it's probably a no-go :(.

Besides the exchange question, I realized that the sampleTDataRaw has a column called CORR, which, according to your paper, is the correlation indicator (which looks to have a value of 0/1). Correct me if I am wrong, but this column is fairly unlikely to show up in many data feeds' raw data; my gut feeling is that we have to generate this column ourselves?

Thanks a lot!

onnokleen commented 1 year ago

CORR stands for corrected trades. Hence, if you want to include all trades, then just add a CORR column with ones in it.

onnokleen commented 1 year ago

@kboudt Can you close this issue?

jonathancornelissen / highfrequency

quotesCleanUp to deal with quotes data from SIP source #90