polygon-io / issues

Quickly track and report problems with polygon.io
29 stars 0 forks source link

Nanosecond timestamps on Websockets #41

Closed joshlang closed 2 years ago

joshlang commented 4 years ago

The Rest API for historic trades provides timestamps with nanosecond precision - https://polygon.io/docs/#get_v2_ticks_stocks_trades__ticker___date__anchor

The Websocket API for live trades provides timestamps with only millisecond precision - https://polygon.io/sockets

It complicates some scenarios a bit. Although we can totally work around the issues... why not just provide the live stream with nanosecond precision anyway?

The main scenario requiring workarounds in our case is when feeding a stream of historical/live data into an algorithm. Sometimes, only millisecond data exists (from the websocket feed). When restoring a lost connection, we need to request historical data for the time we've been disconnected. It comes from the REST API. This requires that we multiply the timestamp by 1,000,000 for the request. It also requires that we do extra comparisons and maintain extra state in order to filter out duplicates (trades within the same millisecond).

pps83 commented 4 years ago

I also don't get the quirk with polygon API: why is that live and historical marked data are so disconnected and barely have anything in common? Why don't you make them identical as much as possible? Using historical data for backtesting must be the same as using live data

dbregman commented 4 years ago

Not a polygon employee, just a user of the feed. I would be against this suggestion. My thoughts:

joshlang commented 4 years ago

You’re right in what you say. But from another perspective:

An alternate stream could be opted into, so old code isn’t broken

The value of nanoseconds is only as questionable for live trades as it is for historical ones

Filtering trades isn’t done because they’re in the same millisecond. Filtering is needed because if I lost connection at a certain millisecond, I can’t request “the next trade”. For example, for a single millisecond, there are 5 trades. After the first, I’m disconnected. I can retriever historical data from the next millisecond (but I miss 4 trades) or from the current millisecond (then I get 5 trades, one of which was duplicate from before I lost the connection).

Anyway, we have workarounds and the data we get is just fine - it’s just an enhancement suggestion :)

degree73 commented 4 years ago

Nanosecond timestamps would be huge for me because it allows correlation between the live feed and historical feed. This is paramount especially since the live feed does not necessarily send the events in the same sequence as the historical feed. Depending on the ticker, you could have many, many events occur within the same millisecond with no way to know the sequence. As algorithms become more and more prevalent, this problem will become much bigger. I have the resend/duplicate filter challenge as well.

I have also noticed that the millisecond/nanosecond discrepancy has made it more difficult for the Polygon.io team to troubleshoot when issues have come up in the past.

I would be fine if a new version of the stream is created for nanosecond timestamps or if the existing stream is updated. Either way it wouldn't be much hassle for me.

dbregman commented 4 years ago

The live feed often has latencies well over 100ms, so it is completely unsuitable for any kind of high frequency applications to begin with. There is no way that the ordering of events within 1ms has any practical relevance to a consumer of this feed.

The use-case of patching lost data from the historical API does not require or benefit from nanosecond timestamps. It's very simple: after a disconnection, just replace all records from a time interval where you think the data was dropped, by data from the historic API.

joshlang commented 4 years ago

The live feed often has latencies well over 100ms, so it is completely unsuitable for any kind of high frequency applications to begin with. There is no way that the ordering of events within 1ms has any practical relevance to a consumer of this feed.

I'm sorry to hear you cannot find a use for the data.

The use-case of patching lost data from the historical API does not require or benefit from nanosecond timestamps. It's very simple: after a disconnection, just replace all records from a time interval where you think the data was dropped, by data from the historic API.

Yes, it's what we do. The process of doing this would be simplified were this enhancement to be implemented.

degree73 commented 4 years ago

@dbregman, You clearly haven't had the same issues as I have working with Polygon.io to get the live feed as accurate as possible. It's really difficult to try to correlate discrepancies when there is no correlation mechanism. Having nanosecond timestamps would have saved weeks of effort and resulted in fixes much sooner (which benefits everyone). In addition, are you now the sole judge of anything that has practical relevance?

pps83 commented 4 years ago
  • the feed cannot be changed without breaking client code that depends on the timestamp field being milliseconds.

An alternative API v2 feed could be added. Obviously, old feed should be left as-is

qrpike commented 4 years ago

We are currently exploring this as an option. One of the factors is bandwidth. The desired end goal is to have the streaming data models be identical to the historical data models. With the newer data models there are up to 3 nanosecond precision timestamps ( SIP timestamp, Exchange timestamp and TRF timestamp), which would greatly increase bandwidth usage since compressing nanosecond timestamps gives little size savings.

We definitely want to make this an option, and we will. Figuring out a way to do this ( either by protobufs, a lower level TCP interface, or just add more ISPs to our infrastructure ) is the point we are currently at.

If you have any thoughts on this, please do comment.

joshlang commented 4 years ago
    {
      "T": "AAPL",
      "t": 1547787608999125800,
      "y": 1547787608999125800,
      "f": 1547787608999125800,
      "q": 23547,
      "i": "00MGON",
      "x": 11,
      "s": 100,
      "c": [ 1, 2, 3 ],
      "p": 223.001,
      "z": 1
    }

JSON ~ 145-150 bytes

Ticker = (AAPL) - 4 bytes + 1 for length
3 Timestamps = 8 bytes x 3  
Sequence = 4 bytes (int32)
Trade ID = (00MGON) - 6 bytes + 1 for length
Exchange ID = 1 byte
Size = 4 bytes (int32)
Conditions = 3 bytes + 1 for length
Price = ...well, maybe string? "223.001" = 7 bytes + 1 for length
Tape = 1 byte

binary = 58 bytes or so. And less processing power to encode/decode. On the downside, custom serializers are needed (which are definitely no problem on our end). We're considering storing data like this anyway, for speedier processing and more efficient storage, so we'd be thrilled if we got data in this format anyway :D

joshlang commented 4 years ago

FYI: We allocate string pools, so that we don't have 10 million copies of the ticker "AAPL" in memory, which is what would happen via default deserialization (or protobufs). We can work around it with custom json deserializers so that we don't do memory allocations. But I'm not so sure that protobuf has that option. Anyway, just something to keep in mind if you're considering efficiencies.

Personally we're also fine with a larger JSON payload.

pps83 commented 4 years ago

We are currently exploring this as an option. One of the factors is bandwidth. ... If you have any thoughts on this, please do comment.

@qrpike some quick thoughts bellow.

Custom number compression scheme will make it perhaps 10x compression size improvement compared to anything json+zip and will also be like 10x faster. I'd even use this compression scheme to store your ticks on your own servers databases. This will require lots of complex coding though.

For example, if you have 50K tick structs each with N fields for SPY for a day you can "unwrap" them into N arrays that store flat arrays of 50K entries (of similar data type each). Then for each flat array of 50K entries use custom code (depending on type of entries). For example, array of 50K of 64-bit timestamps should be compressed with TurboPFor's p4nenc64/p4nzenc64 64bit number compressor (possibly with zigzag coding). If timestamps are increasing (they should be), then you may as well try to manually delta code from 0th element. You'd be shocked how much better, and faster it compresses, and how faster you'll be able to read smaller chunks from your tick store due to smaller stored size. This is all c/c++ though, you'd need to add custom modules/code to convert all that to readable format for other languages (eg to convert that custom storage data to json or some custom format after transmission). Even though client side decompression and format conversion would be required, it will still be MUCH faster even though client seemingly would need to do something to make sense of the data: if you use specialized compressors tuned for decompression, then data will decompress at the speed of memcpy basically.

I don't think protobuf is good for the job, although they have client code for most languages.

gms008 commented 4 years ago

I would like to +1 the request for nanosecond timestamps in the live feed (or better yet, a separate binary feed). We are currently on the Polygon trial because we were interested in getting raw quote data. Currently using TDA (aggregated) and IB (raw data for some symbols but it isn't great).

We spent some time looking over the historical data and were pleased so we began getting data over websockets and were very surprised to find that the timestamps were different. Why would there be a live stream with data that is different from its historical version? Like another user mentioned above, it doesn't make sense to use different data for live and historical modes. @qrpike you mention bandwidth concerns but at the same time this is a JSON feed. I also don't see how the fact that there are three timestamps is a blocker to making the single existing timestamp a nanosecond timestamp. Is the existing millisecond timestamp not one of the three timestamps in the historical data? Sure, at this point maybe the single timestamp can't be changed without breaking things for people, but why was it like this in the first place? This issue isn't a deal-breaker for us but it just seems very strange.

An aside on the JSON feed: I'm not familiar with the typical user of Polygon but as someone who has worked at multiple HFTs this is the first time I've had to deal with a JSON feed and I don't really understand the design decision behind it. I would guess that most people with at least some experience in HFT or quantitative trading would be familiar enough to work with a binary protocol. I get that there are probably many users that are new to the space that may be more comfortable with JSON, but why not just provide a parsing library to clients on top of a binary feed or just have more than one protocol since it's over websockets anyways? After all, you guys are getting binary feeds from the SIPs so inflating it to JSON seems strange if bandwidth is a concern. I'm new to polygon and generally am pleased with it so far, so would appreciate some enlightenment if I am completely off base here.

Edit: I have a similar opinion as Josh above - at the end of the day I'm fine with larger JSON messages or a binary protocol.

RobinhoodFR commented 4 years ago

+1 and +1

RobinhoodFR commented 3 years ago

@qrpike Hello!

Almost one year later, could you give us what did you do on this so important topic?

Many thanks++

gms008 commented 2 years ago

@qrpike can you please provide an update here? It still doesn't make sense that your live and historical feeds would have different timestamps. Can you please explain why you guys have made this decision? And at least please answer the above question "Is the existing millisecond timestamp not one of the three timestamps in the historical data?". I haven't been able to find an answer for this.

gms008 commented 2 years ago

@qrpike @jrbell19 bump

gms008 commented 2 years ago

@qrpike @jrbell19

And at least please answer the above question "Is the existing millisecond timestamp not one of the three timestamps in the historical data?". I haven't been able to find an answer for this.

jrbell19 commented 2 years ago

Hi @gms008 Apologies for the radio silence.

Are you referring to the single nanosecond timestamp that is streamed via websocket? If so- yes, this is the SIP timestamp that is retrieved through the historic ticks endpoint.

Let me know if I'm misunderstanding the question.

gms008 commented 2 years ago

@jrbell19

Are you referring to the single nanosecond timestamp that is streamed via websocket?

What nanosecond timestamp? The websocket is giving me a millisecond timestamp: [{"ev":"T","sym":"ALK","i":"52983544040775","x":10,"p":51.6,"s":100,"t":1643130166199,"q":1004533,"z":1}]

Your documentation also says it is a millisecond timestamp: https://polygon.io/docs/stocks/ws_stocks_t Am I completely missing something here? Wasn't this entire thread created because the streamed websocket timestamp is a millisecond timestamp and not a nanosecond timestamp?

jrbell19 commented 2 years ago

@gms008 You're correct. That was my mistake, apologies for the confusion.

gms008 commented 2 years ago

@jrbell19 @qrpike Can you please provide an update on this situation? What is being worked on and what are the blockers? I think everyone would agree that a nanosecond timestamp makes way more sense than a millisecond timestamp. Nobody from your team has given a good reason why this change can't be made. @qrpike mentioned that the inclusion of 3 nanosecond timestamps would "greatly increase bandwidth", but I don't think anyone on this page is asking for 3 nanosecond timestamps. We're asking for the single SIP timestamp that is already a nanosecond timestamp on the historical feed also be a nanosecond timestamp on the live feed. I really don't think this is too much to ask and I think we deserve a proper update after all this time.

Having worked with market data feeds from exchanges around the globe I can say it really isn't that uncommon for exchanges to do market protocol updates. An update changing a timestamp's precision would be considered relatively minor. Or, just create a v2 protocol. There are ways to get this done, and it would only benefit you guys to get it done sooner rather than later. I'm happy to take this offline if that would speed things up. The progress here has been really slow.

joshlang commented 2 years ago

abandoning.