iris-edu / mseed3-evaluation

A repository for technical evaluation and implementation of potential next generation miniSEED formats
3 stars 1 forks source link

Maximum Record Length #13

Open krischer opened 7 years ago

krischer commented 7 years ago

Discussion branched off #2. Concerns DRAFT20170622.

@crotwell

Field 9, consider UINT32. It is really nice for processing data to be able to store a long continuous time series as a single record like SAC and 65K is kind of small for that. I have no problem with a recommendation that data loggers only generate small (~512 or 4096) or data centers choose a maximum for acceptance or internal storage. The header allows UINT32 samples but not enough bytes to put them in.

chad-earthscope commented 7 years ago

A while back @krischer wrote a nice paragraph describing the downsides of huge records. I can't find it at the moment, perhaps he can dig it up, it was a good summary of reasons why not to do huge records. Dave Ketchum mentioned the problem of sample time drift over long arrays, but Lion had other points.

I do not think we are trying to compete with SAC, the goals are very different. My 2 cents.

Also to consider: if we blocked/chunked the data payload such that the header does not need to contain the length then, heck, there are no limits, you can make them as big as you want.

crotwell commented 7 years ago

I feel the opposite, it is painful to have to deal with multiple file formats for saving seismic data. Everybody has to read miniseed, but then once they have done processing, save in a different format? All I am asking for here is 2 extra bytes in the record size header. I agree a datalogger or a datacenter should not do really long records for raw data, but it is very useful for the end user to save one big float or int array instead of being forced to break it up after you have already made the decision that the timing is good enough. The SAC file format is painful for lots of other reasons, and with these 2 extra bytes miniseed could quickly become the only file format processing systems need to support. That would be a real benefit to seismology, for the cost of only 2 measly bytes.

chad-earthscope commented 7 years ago

In draft 20170708 there is no hard limit to a record length. There remains a limit for a data block of 65k, but there is no limit on the number that can be included in a record.

krischer commented 7 years ago

it is very useful for the end user to save one big float or int array instead of being forced to break it up after you have already made the decision that the timing is good enough.

I can see that argument but I don't see why it is a big problem. Some library will perform the record split so it is invisible to users.

Another things to keep in mind is that very large records will make it a lot harder to split up MiniSEED files and it will also make the checksum at the end more expensive to compute. Additionally the checksum will be less meaningful as a single check is now performed for a potentially very large record and a single bit flip will invalidate the whole record without any chance of figuring out where it goes wrong.

Even more it will become technically more challenging as the checksum calculation requires access to the whole data after it has been encoded. As libraries cannot just require a potential additional 4 GB of memory to write to it would require some awkward flip-flop of writing everything to disc -> read again and calculate checksum -> write checksum. This is IMHO a fairly realistic concern.

A while back @krischer wrote a nice paragraph describing the downsides of huge records. I can't find it at the moment, perhaps he can dig it up, it was a good summary of reasons why not to do huge records. Dave Ketchum mentioned the problem of sample time drift over long arrays, but Lion had other points.

I also cannot find it right now but a 32 bit sampling rate is not accurate enough to correctly determine the times of later samples - but this indeed only becomes important for fairly large sample counts so I'm not sure it represents a problem in practice.

I found this script on my machine which demonstrates the problem - I guess I initially wrote this for some related discussion. While it is indeed a bit contrived it is the equivalent of a 124 day recording at 200 Hz. The data-format should IMHO not allow something that is wrong and if MiniSEED allows something people will definitely do it.

Three possible ways around this:

So while I can understand @crotwell's arguments there are a lot of downsides to large records and I feel like they are not worth the trade-off. Also MiniSEED 2 is already used as a processing format so I don't see why it should no longer be the case with MiniSEED 3.

I'm reopening this for further discussion.

from decimal import Decimal as D
import numpy as np

starttime_in_ns = 124734934578
# Awkward sampling rate to force floating point errors.
sampling_rate_in_sec = 201.12345678
# Max number of samples for 4 bytes record length field. Assumes 2 bytes per
# sample which is very achievable with compression.
samples = 4294967295 // 2

endtime_in_ns_d = \
    D(starttime_in_ns) + D("1000000000") / D(sampling_rate_in_sec) * D(samples)
# Use a single precision sampling rate.
endtime_in_ns = \
    starttime_in_ns + 1000000000 / np.float32(sampling_rate_in_sec) * samples

print("Endtime in ns - accurate:      ", int(endtime_in_ns_d))
print("Endtime in ns - floating point:", int(endtime_in_ns))
diff = abs(int(endtime_in_ns_d) - int(endtime_in_ns))
print("Difference in ns:              ", diff)
print("Difference as a factor of dt:  ",
      (diff / 1E9) / (1.0 / sampling_rate_in_sec))

output:

Endtime in ns - accurate:       10677564757999798
Endtime in ns - floating point: 10677564647452358
Difference in ns:               110547440
Difference as a factor of dt:   22.233683270979643
crotwell commented 6 years ago

If we are going to limit data size in the record to 16 bits (~ 64K) then there is no need to have Unit32 number of samples. Basically, the arguments above about timing and large records are really about large number of samples, even if they compress really well.

I still like the idea of large single records, but if we are going to disallow them via limiting the record size to Uint16, then we should also limit the number of samples to Uint16. Or flipping it around, if we allow Uint32 samples, we should allow Uint32 bytes to put them in.

chad-earthscope commented 6 years ago

Maximum number of samples in a 64-byte Steim2 frame is 105. There is a bit of overhead taking up a few more bytes depending on first frame or not but it's still more than 64 samples in a 64-byte frame. Since we can have more than 1 sample/byte we need a sample count larger than byte count.

crotwell commented 6 years ago

I still don't get it. You argued that records should not be too large due to sample drift, but now you are ok with a single record with huge number of samples in it as long as it compress really well?

In other words, is allowing 2^17 samples to be packed into a 2^16 byte record really that much of a benefit over forcing them to be split into 2 records? Is it worth the extra 2 bytes that will be zeros >99.99% of the time? I feel this edge case of packing a maximally sized record to capacity with highly compressible data is not worth it.

I feel records should either be limited to be "small" in both senses, or should be allowed to be large in both senses.

chad-earthscope commented 6 years ago

I feel records should either be limited to be "small" in both senses, or should be allowed to be large in both senses.

My thinking was to provide a single limiter (length), instead of two (length or sample count) which is just a bit more complex and a (minor) wrinkle for record creators who try to create maximum size records. The original Strawman had the 2-limiter issue and it was commented on. I do not feel strongly about this and would be fine with a UINT16, maximum 2^16 samples. Does anyone else have strong feelings?