iris-edu / mseed3-evaluation

A repository for technical evaluation and implementation of potential next generation miniSEED formats
3 stars 1 forks source link

Low latency transmission #16

Closed crotwell closed 7 years ago

crotwell commented 7 years ago

The Header/Footer separation is intended to make it more efficient to transmit low latency mseed data, for example for early warning, where the time to fill a 512 or even smaller record is too large for the latency needs of the system.

I am not convinced that simply moving the number of samples, CRC and extras to the footer really handles the needs of low latency users, and suspect that even trimmed down, mseed3 records may still be too heavy for early warning type needs. In other words, using the footer is fine if it actually solves the low latency needs and is used. But if it doesn't solve the issue and ends up not being used it is an added complexity for no gain for the rest of the users.

I do not have a good feel for this, so just creating this issue so we can track any experimentation or ideas on whether the footer is useful or not needed.

andres-h commented 7 years ago

The main problem is Steim2, putting the reverse integration constant in the beginning. We need a better encoding, such that at least individual 64-bit frames can be sent.

In this case even 4k record size would not be a problem -- you just send the headers and then keep sending the frames embedded in transient packets.

You could suggest a completely different protocol for transmission, but I like the idea that the data is checksummed in the digitizer and I can check the integrity in the archive any time later. Organizations like CTBTO may even need SHA hash to make sure that the data has not been tampered with.

"Footer" is an irregularity that I don't like. It's just another chunk...

crotwell commented 7 years ago

Section 5: Definition of transient data payload sub-header

Is this really necessary? This should IMHO be delegated to some lower level protocol. TCP can do multiplexing, as can HTTP/2, and who knows what will be out in 10 years.

@krischer Did you mean a separate socket per channel or multiplexing over single socket? Have any links as to how this would work if on a single socket if that is what you meant? Would this work if the logger used UDP for data transmission? It is relatively easy if an entire record is sent at once and fits into a single UDP packet, but once it is split it becomes complex.

@andres-h Does the checksum and/or hash need to be per mseed record, or on each subsegment as it is sent?

Within the "footer" system. how does a receiving system make use of a partially transmitted record? If the data is compressed and it does not know the final number of samples, would it be able to decompress the partial data and tell the difference between a data value that is zero because there will be no more data in the record and a data value that is zero because the sample at that time was actually zero?

More detailed example, say a channel generates a count sequence of -1 2 4 -3 -1025 but due to low latency wants to send them as they come in. In the existing system, it has to pick a byte size in advance (header field 9). Lets say that it picked a value that would hold 5 more samples as long as they compress nicely. It sends -1, 2, 4 -3. Now it wants to send the -1025, but the -1025 is too big and doesn't fit in the remaining bytes. So it must fill out the compression frame with zeros and then send the footer with the number of samples so the receiver knows when to stop decompressing. But there is a period of time after the zeros are sent but before the "footer" arrives where the receiver can't tell if the zero is a real value, or just a padding to fill the compression frame. The data could just as well really have been -1 2 4 -3 0, so the receiver has decide how to deal with partial data based not on the header information, but also on the actual data values. Not impossible but seems complex and painful.

And I think this could happen even in the instance of non-compressed data if the logger is forced to close a record early (clock jerk or stop acquisition signal). It has already sent the byte size of the data, but actually needs to send fewer samples than intended, so the variable length data section must be padded with a non-value.

Maybe this can be made to work, but I would like to see perhaps a mock send/receive system put together to validate the ideas as this type of partial send idea can be very tricky. I am wondering if some type of "transient" sub-payload may be needed to send a few samples with its byte size and number of samples in a small package. Maybe the system sends a empty record to get the header stuff, then repeatedly sends something as simple as:

which the receiver than can correctly append to the current record. More information would be needed if multiplexing is not taken care of by the underlying protocol. Also currently no extra headers would arrive until the record is closed, so stuff like time problem or leap seconds might be delayed in arriving.
andres-h commented 7 years ago

On Wednesday 2017-07-05 21:09, Philip Crotwell wrote:

@andres-h Does the checksum and/or hash need to be per mseed record, or on each subsegment as it is sent?

Record checksum would be per record. Maybe the transient packet also needs some checksum or hash, but that's another topic.

And I think this could happen even in the instance of non-compressed data if the logger is forced to close a record early (clock jerk or stop acquisition signal). It has already sent the byte size of the data, but actually needs to send fewer samples than intended, so the variable length data section must be padded with a non-value.

Yeah, that's why I wrote in the other post that sub-chunk transfer might not be possible. The encoding must support it.

I think it would be enough when data is sent frame-by-frame (or chunk-by-chunk)...

chad-earthscope commented 7 years ago

With the significant differences between what is needed for very low latency, lean data packets versus archive/non-realtime-exchange usage I do not think we should make a format expecting it to be perfect for streaming, it is not the priority. Given that, should we accept any complexities added to the format to accommodate stream-abilty? I would say yes, at least minimal complexity to avoid blocking stream-ability is worth it.

... More detailed example, say a channel generates a count sequence of -1 2 4 -3 -1025 but due to low latency wants to send them as they come in. In the existing system, it has to pick a byte size in advance (header field 9). Lets say that it picked a value that would hold 5 more samples as long as they compress nicely. It sends -1, 2, 4 -3. Now it wants to send the -1025, but the -1025 is too big and doesn't fit in the remaining bytes. So it must fill out the compression frame with zeros and then send the footer with the number of samples so the receiver knows when to stop decompressing. But there is a period of time after the zeros are sent but before the "footer" arrives where the receiver can't tell if the zero is a real value, or just a padding to fill the compression frame. The data could just as well really have been -1 2 4 -3 0, so the receiver has decide how to deal with partial data based not on the header information, but also on the actual data values. Not impossible but seems complex and painful.

Agreed, that is complex and painful. The problem originates with having to specify the size of the data payload in total. That could be solved if we go with @andres-h's "chunk" concept for waveform data, like:

This adds complexity for the sake of streaming and some inefficiency due to the extra block designation, length, bytes for the archiving/exchange usage.

Also currently no extra headers would arrive until the record is closed, so stuff like time problem or leap seconds might be delayed in arriving.

This and the other details you've brought up are good illustrations of the subtle complexities of trying to make a streaming protocol and I'm afraid if we try to address all of them it will be at the cost of the archive/exchange usability.

crotwell commented 7 years ago

Given that, should we accept any complexities added to the format to accommodate stream-abilty? I would say yes, at least minimal complexity to avoid blocking stream-ability is worth it.

I am in favor, but think we need to make sure the complexities we add actually are helpful to stream-ability. I worry that, unless we are very careful, what we choose might add to the archive complexity but be insufficient, and hence unused, in the streaming case.

This and the other details you've brought up are good illustrations of the subtle complexities of trying to make a streaming protocol and I'm afraid if we try to address all of them it will be at the cost of the archive/exchange usability.

My fear exactly. So this needs thinking hard and carefully about. The safest choice may be to say streaming should be a separate proposal with separate, but maybe related, data structures. The streaming objects/protocol would be set up in a way that the receiver can easily construct mseed records, but maybe records are not directly be sent over the wire. Just a thought.

chad-earthscope commented 7 years ago

"Footer" is an irregularity that I don't like. It's just another chunk...

Fair enough. For what it's worth I really dislike the word "chunk", it's use is almost always colloquial in english. Can we agree on "block" for these things?

andres-h commented 7 years ago

On 07/06/2017 01:55 AM, Chad Trabant wrote:

For what it's worth I really dislike the work "chunk", it's use is almost always colloquial in english. Can we agree on "block" for these things? OK with me. "chunk" is just a placeholder anyway.

"block" is still rather generic, though. I wish we had a specific word for SEED. Calling it "blockette" would be fine with me too.

crotwell commented 7 years ago

Thought of two more potential problems I just want to document. Both are due to total number of bytes being sent early.

1) The logger needs to insert a extra header, but doesn't have room. Imagine the logger has set the number of bytes in the record, and has sent most of them upstream when it needs to insert a event detection or time exception, but there are not enough bytes remaining for the extra header to fit.

2) The logger has sent some samples, but something happens to stop data acquisition. It can end the data section and set the number of samples in the footer with extra headers, but would need to pad the output to reach the number of bytes. How would the padding be done so that the receiver would correctly interpret it?

chad-earthscope commented 7 years ago

Thought of two more potential problems I just want to document. Both are due to total number of bytes being sent early.

In the 20170622 draft it is only the data payload that must be set, not the total record length...

The logger needs to insert a extra header, but doesn't have room. Imagine the logger has set the number of bytes in the record, and has sent most of them upstream when it needs to insert a event detection or time exception, but there are not enough bytes remaining for the extra header to fit.

.. and the extra headers are in the "termination blockette" who's length is not know until it arrives, so the above is not a problem ... The logger has sent some samples, but something happens to stop data acquisition. It can end the data section and set the number of samples in the footer with extra headers, but would need to pad the output to reach the number of bytes. How would the padding be done so that the receiver would correctly interpret it?

... but this one is still a problem, padding would be needed and that's problematic as you have illustrated.

More generally: I've gathered a number of changes to the first draft from our conversation and I'm ready to make another and I think we can address the above.

krischer commented 7 years ago

I don't have time to answer in detail - but why is (even the current MiniSEED) not suitable for low latency things? It could just be fairly short records that are send one or more times per second - each record is still below the usual MTU size so the network latency could not be decreased any further either by the streaming proposal or the footer.

chad-earthscope commented 7 years ago

I don't have time to answer in detail - but why is (even the current MiniSEED) not suitable for low latency things?

The current issue is that the record length is fixed and relatively big.

It could just be fairly short records that are send one or more times per second - each record is still below the usual MTU size so the network latency could not be decreased any further either by the streaming proposal or the footer.

Variable record length is probably the most important change for low latency applications because you can make small records. While it would be workable to send a record once per second, assuming you could spare the bandwidth, that'd be really poor data to archive as you could easily end up with more headers than data. The streaming concept that I belive we are driving towards would allow larger records that could be sent incrementally, so suitable for streaming and (hopefully) efficient enough for archiving.

crotwell commented 7 years ago

I am not sure as I have not done any low latency work, but I think the issue is you have two choices, either use small records (256, 128) and fill them, in which case you latency is due to waiting to get enough samples to fill the record.

If you choose to send records that are not completely filled to get subsecond latency, then the overhead of the record header can cause overwhelm the data volume. So sending a tenth of a second of data for 100sps means only 10 samples, so likely 40 header bytes for only 10 data bytes means a bloat factor of 5. That can be a problem you are already close to the a bandwidth limit.

Biggest latency is not on the network, but rather on the data collection.

chad-earthscope commented 7 years ago

In draft 20170708 records may be variable length and data blocks may be variable length and transmitted as they are generated. These characteristics overcome the major issues with streaming low-latency data.