read_data with mseed integrity check fails

tclements commented 4 years ago

Reading this mseed file prints integrity warnings:

RDMSEED: data integrity -- Steim-2 sequence #001194Q integrity check failed, last_data=2250.0, should be xn=2518.0

The file is written with Obspy. The first sample is the same with both readers:

SeisData with 1 channels (1 shown)
    ID: YA.SNE.00.HHZ                      
  NAME: YA.SNE.00.HHZ                      
   LOC: 0.0 N, 0.0 E, 0.0 m                
    FS: 100.0                              
  GAIN: 1.0                                
  RESP: a0 1.0, f0 1.0, 0z, 0p             
 UNITS:                                    
   SRC:                                    
  MISC: 0 entries                          
 NOTES: 0 entries                          
     T: 2009-12-28T10:56:04.968 (0 gaps)   
     X: +1.428e+04                         
        +1.420e+04                         
            ...                            
        -3.307e+03                         
        (nx = 4703504)                     
     C: 0 open, 0 total

S[1].x
4703504-element Array{Float32,1}:
 14278.0
 14199.0
 14089.0
     ⋮  
 -3663.0
 -3450.0
 -3307.0

1 Trace(s) in Stream:
YA.SNE.00.HHZ | 2009-12-28T10:56:04.968300Z - 2009-12-28T23:59:59.998300Z | 100.0 Hz, 4703504 samples

st[0].data                                                                                                            
array([14278, 14426, 14453, ..., -3565, -3450, -3456], dtype=int32)

Converting the file to sac in obspy and then reading the sac file in SeisIO gives the same data as the mseed file read in obspy

S = read_data("/home/timclements/VOLC/SAC/YA.SNE.00.HHZ.2009-12-28.sac")
SeisData with 1 channels (1 shown)
    ID: YA.SNE..HHZ                        
  NAME:                                    
   LOC: 0.0 N, 0.0 E, 0.0 m                
    FS: 100.0                              
  GAIN: 1.0                                
  RESP: a0 1.0, f0 1.0, 0z, 0p             
 UNITS:                                    
   SRC: /home/timclements/VOLC/SAC/YA.SNE… 
  MISC: 0 entries                          
 NOTES: 1 entries                          
     T: 2009-12-28T10:56:04.968 (0 gaps)   
     X: +1.428e+04                         
        +1.443e+04                         
            ...                            
        -3.456e+03                         
        (nx = 4703504)                     
     C: 0 open, 0 total

julia> S[1].x
4703504-element Array{Float32,1}:
 14278.0
 14426.0
 14453.0
     ⋮  
 -3565.0
 -3450.0
 -3456.0

jpjones76 commented 4 years ago

I'll look into this, but can you tell me a little more about how the file was created? You say that it's mseed created in ObsPy; what endianness is the machine and what's the bswap situation look like?

I'm still working on changes to fix Julia's read slowdown in 1.3.0, so this might take some time before a fix goes live.

jpjones76 commented 4 years ago

Also, could you upload the converted SAC file?

tclements commented 4 years ago

I used obspy.get_waveforms to download the data into an obspy.Stream object, then write each station individually. Code as follows:

import obspy
from obspy.clients.fdsn.client import Client
import os 

client = Client("RESIF")
t1 = obspy.UTCDateTime("2009-09-16") 
t2 = obspy.UTCDateTime("2011-6-19")
out = "~/VOLC/MSEED/"

while t1 <= t2:
    try:
        st = client.get_waveforms("YA","*","00","HHZ",t1,t1+86400)
    except:
        st = []

    if len(st) != 0:
        sta = list(set([tr.stats.station for tr in st]))
        for s in sta:
            tr = st.select(station=s)
            filename = ".".join([tr[0].stats.network,
                                    tr[0].stats.station,
                                    tr[0].stats.location,
                                    tr[0].stats.channel,
                                    str(tr[0].stats.starttime.date),
                                    "mseed"]) 
            outfile = os.path.join(out,filename)
            print("Writing: ",filename)
            tr.write(outfile,format="MSEED")
    t1 += 86400

This writes Steim2 by default.

Here is lscpu on my laptop:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          12
On-line CPU(s) list:             0-11
Thread(s) per core:              2
Core(s) per socket:              6
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           158
Model name:                      Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz

Here is the sac file

jpjones76 commented 4 years ago

Well, I found the problem. There is no good news. It's another "feature" in SEED that isn't properly documented, and unfortunately, I no longer have time in my life to hunt and peck with coding guesses until I read the SEED creators' minds correctly.

There's a "word order" flag (not properly documented) in Blockette [1000] that affects the order in which one should assign Steim-compressed differences as they're unpacked from a UInt32.

Unfortunately, when word_order == 0x00, I don't know the proper order to uncompress the integer values. The SEED manual has only this cryptic nonsense to say about it: "The byte swapping order for 16 bit and 32 bit words. A 0 indicates little endian order and a 1 indicates big endian word order. See fields 11 and 12 of blockette 50." Those fields in that blockette contain exactly the same information stated in more words. This is mystery meat, effectively an undocumented flag.

This doesn't appear to be as simple as an "if" loop that reverses the order of unpacking when word_order == 0x00. I've tried that route and the integrity check still fails.

Unfortunately, what that means for your group is that I can't support mini-SEED with Steim variables packed in littleendian "word order" until someone shows me exactly -- I mean exactly, not some cryptic hint at -- the correct unpack order. I've spent 6 hours already tonight trying to guess what this word-salad manual is getting at and barely got anywhere. If I can't figure this out in a couple more hours, someone else will have to.

Sincere apologies.

jpjones76 commented 4 years ago

For what it's worth, I suspect that word_order = 0x00 is very rare. This is the first time I've encountered it in ~20 years working with SEED format. I strongly suspect that the only way to even achieve that is to write to mini-SEED in ObsPy, because literally all other SEED that I've seen is bigendian.

tclements commented 4 years ago

Thanks for looking into this. Is this of any help - obspy has a record analyzer that shows the encoding format for blockette 1000.

python -m obspy.io.mseed.scripts.recordanalyzer ~/VOLC/MSEED/YA.SNE.00.HHZ.2009-12-28.mseed 
FILE: /home/timclements/VOLC/MSEED/YA.SNE.00.HHZ.2009-12-28.mseed
Record Number: 0
Record Offset: 0 byte
Header Endianness: Little Endian

FIXED SECTION OF DATA HEADER
        Sequence number: 1
        Data header/quality indicator: Q
        Station identifier code: SNE
        Location identifier: 00
        Channel identifier: HHZ
        Network code: YA
        Record start time: 2009-12-28T10:56:04.968300Z
        Number of samples: 3105
        Sample rate factor: 100
        Sample rate multiplier: 1
        Activity flags: 0
        I/O and clock flags: 0
        Data quality flags: 0
        Number of blockettes that follow: 1
        Time correction: 0
        Beginning of data: 64
        First blockette: 48

BLOCKETTES
        1000:   Encoding Format: 11
                Word Order: 0
                Data Record Length: 12

CALCULATED VALUES
        Corrected Starttime: 2009-12-28T10:56:04.968300Z

I'll look into how I'm creating these mseed files with obspy.

Maybe @chad-iris could be of help?

tclements commented 4 years ago

I think I found the problem. The call to st = client.get_waveforms("YA","*","00","HHZ",t1,t1+86400) returns a stream with 17 different station - channels:

st = client.get_waveforms("YA","*","00","HHZ",t1,t1+86400) 
st                                                   
17 Trace(s) in Stream:
YA.FJS.00.HHZ  | 2009-12-28T10:56:12.968300Z - 2009-12-28T23:59:59.998300Z | 100.0 Hz, 4702704 samples
YA.FLR.00.HHZ  | 2009-12-28T10:56:17.968300Z - 2009-12-28T23:59:59.998300Z | 100.0 Hz, 4702204 samples
YA.FOR.00.HHZ  | 2009-12-28T10:56:13.968300Z - 2009-12-28T23:59:59.998300Z | 100.0 Hz, 4702604 samples
YA.RVL.00.HHZ  | 2009-12-28T10:56:16.968300Z - 2009-12-28T23:59:59.998300Z | 100.0 Hz, 4702304 samples
YA.SNE.00.HHZ  | 2009-12-28T10:56:04.968300Z - 2009-12-28T23:59:59.998300Z | 100.0 Hz, 4703504 samples
YA.UV01.00.HHZ | 2009-12-28T00:00:00.000000Z - 2009-12-29T00:00:25.930000Z | 100.0 Hz, 8642594 samples
YA.UV02.00.HHZ | 2009-12-28T00:00:00.000000Z - 2009-12-29T00:00:20.030000Z | 100.0 Hz, 8642004 samples
YA.UV03.00.HHZ | 2009-12-28T00:00:00.000000Z - 2009-12-29T00:00:18.870000Z | 100.0 Hz, 8641888 samples
YA.UV05.00.HHZ | 2009-12-28T00:00:00.000000Z - 2009-12-29T00:00:19.350000Z | 100.0 Hz, 8641936 samples
YA.UV07.00.HHZ | 2009-12-28T00:00:00.000000Z - 2009-12-29T00:00:36.770000Z | 100.0 Hz, 8643678 samples
YA.UV08.00.HHZ | 2009-12-28T00:00:00.000000Z - 2009-12-29T00:00:37.690000Z | 100.0 Hz, 8643770 samples
YA.UV09.00.HHZ | 2009-12-28T00:00:00.000000Z - 2009-12-29T00:00:33.290000Z | 100.0 Hz, 8643330 samples
YA.UV10.00.HHZ | 2009-12-28T00:00:00.000000Z - 2009-12-29T00:00:18.870000Z | 100.0 Hz, 8641888 samples
YA.UV11.00.HHZ | 2009-12-28T00:00:00.000000Z - 2009-12-29T00:00:28.610000Z | 100.0 Hz, 8642862 samples
YA.UV12.00.HHZ | 2009-12-28T00:00:00.000000Z - 2009-12-29T00:00:19.470000Z | 100.0 Hz, 8641948 samples
YA.UV13.00.HHZ | 2009-12-28T00:00:00.000000Z - 2009-12-29T00:00:37.410000Z | 100.0 Hz, 8643742 samples
YA.UV14.00.HHZ | 2009-12-28T00:00:00.000000Z - 2009-12-29T00:00:36.570000Z | 100.0 Hz, 8643658 samples

The endianness and encoding for the three letter stations (FJS, FLR, etc ...) is different from the UV?? stations. Here is YA.SNE.00.HHZ:

In [22]: st[4]                                                                                                  
Out[22]: YA.SNE.00.HHZ | 2009-12-28T10:56:04.968300Z - 2009-12-28T23:59:59.998300Z | 100.0 Hz, 4703504 samples

In [23]: st[4].stats.mseed                                                                                      
Out[23]: AttribDict({'dataquality': 'Q', 'number_of_records': 1275, 'encoding': 'STEIM2', 'byteorder': '<', 'record_length': 4096, 'filesize': 156889088})

And here is YA.UV01.00.HHZ:

In [25]: st[5]                                                                                                  
Out[25]: YA.UV01.00.HHZ | 2009-12-28T00:00:00.000000Z - 2009-12-29T00:00:25.930000Z | 100.0 Hz, 8642594 samples

In [26]: st[5].stats.mseed                                                                                      
Out[26]: AttribDict({'dataquality': 'Q', 'number_of_records': 2503, 'encoding': 'STEIM1', 'byteorder': '>', 'record_length': 4096, 'filesize': 156889088})

Note that the byteorder for st[4] is <, while byteorder for st[5] is >.

If I write all the files with byteorder = >, SeisIO has no problem reading the new file:

In [31]: st[4].write("test.mseed",format="MSEED",byteorder=">")               

In [32]: st[4].data                                                           
Out[32]: array([14278, 14426, 14453, ..., -3565, -3450, -3456], dtype=int32)

julia> S = read_data("mseed","test.mseed")
SeisData with 1 channels (1 shown)
    ID: YA.SNE.00.HHZ                      
  NAME: YA.SNE.00.HHZ                      
   LOC: 0.0 N, 0.0 E, 0.0 m                
    FS: 100.0                              
  GAIN: 1.0                                
  RESP: a0 1.0, f0 1.0, 0z, 0p             
 UNITS:                                    
   SRC:                                    
  MISC: 0 entries                          
 NOTES: 0 entries                          
     T: 2009-12-28T10:56:04.968 (0 gaps)   
     X: +1.428e+04                         
        +1.443e+04                         
            ...                            
        -3.456e+03                         
        (nx = 4703504)                     
     C: 0 open, 0 total

julia> S[1].x
4703504-element Array{Float32,1}:
 14278.0
 14426.0
 14453.0
     ⋮  
 -3565.0
 -3450.0
 -3456.0

Seems that obspy is writing byte order incorrectly for these particular files.

chad-earthscope commented 4 years ago

The SEED manual specifically forbid anything but big-endian Steim compression. So, pedantically, those data (I presume from RESIF) are not valid SEED. One consequence is that there is no technical description of what little-endian Steim compression is. That being said, libmseed (used by ObsPy) and qlib2 both support the same flavor of little-endian Steim, and it's only defined by the software that writes it. So tools built on these libraries generally just work. This flavor of little-endian Steim existed before the FDSN declared that Steim was big-endian only. So that's the brief history of the situation, and these are the variations you find when re-creating the wheel of parsing miniSEED.

For maximum compatibility and adherence to the spec., Steim should be big-endian. It seems ObsPy is writing it out in the same endian it's read it. FWIW, the next major version of libmseed will not write little-endian miniSEED 2 at all, to avoid this confusion, it's just not worth any perceived advantage of "native" byte order.

jpjones76 commented 4 years ago

@chad-iris, do you mean that "little-endian word order" (Blockette [1000], field 4) is ambiguously defined in the SEED documentation? That's the problem I can't resolve with how these data are written. For example:

Which side of the UInt32 (in hexadecimal order) holds the unused bits in little-endian word order: left or right?
In the same word order, should integers be unpacked left-to-right or right-to-left?
Are the answers to both of the above questions always true, or are there cases where I need to change the read order (or the assumed positions of the unused bits)?

Should we open an issue with the ObsPy team to warn them about this? Writing Steim little-endian mini-SEED by default can undercut the FDSN standard, simply because ObsPy is so popular. It's certainly not just @tclements writing mini-SEED files; I'd imagine that ~10^3 other users are doing this on any given day.

chad-earthscope commented 4 years ago

@chad-iris, do you mean that "little-endian word order" (Blockette [1000], field 4) is ambiguously defined in the SEED documentation?

There is no clarity on what this flag means to each of the encodings. For the fundamental float/integer types it's rather obvious, but for encodings with quantities that are not byte-oriented, like Steim, it's undefined what that flag means.

For "little endian" Steim 1&2 the flavor used by qlib2 (and libmseed following it) is to byte swap the 16 and 32-bit quantities including the 32-bit word of nibbles and the 32-bit words containing quantities of anything but 8-bits. While it reads strange it's really just the same byte-swapping you need to do to read/write miniSEED on a machine of the opposite order, e.g. big-endian miniSEED on a little-endian machine. Put another way, you'll do the same byte-swapping to read the data on both architectures as you would do to write the data in both endians, so not so strange. A pragmatic suggestion, should you want to support/understand this, is to read the libmseed unpacking source and see where the swaping is done.

Should we open an issue with the ObsPy team to warn them about this?

Feel free. A couple things to note: a) the input data was little-endian Steim, which is quite rare, and is the source of the "problem", this is probably why ObsPy wrote it that way, and b) it is not immediately a problem or noticeable by many because a lot of software (based on the mentioned libraries) can read it.

jpjones76 commented 4 years ago

For "little endian" Steim 1&2 the flavor used by qlib2 (and libmseed following it) is to byte swap the 16 and 32-bit quantities including the 32-bit word of nibbles and the 32-bit words containing quantities of anything but 8-bits. While it reads strange it's really just the same byte-swapping you need to do to read/write miniSEED on a machine of the opposite order, e.g. big-endian miniSEED on a little-endian machine. Put another way, you'll do the same byte-swapping to read the data on both architectures as you would do to write the data in both endians, so not so strange.

I looked at this last night but was having trouble with an equivalent Julia implementation. The logic of swap operations in your code appears to be:

Always byte-swap the UInt32 that contains the "nibbles" values
Steim1:
- Don't byte-swap the packed UInt32 before unpacking Int16s or Int8s
- Byte-swap Int16 on unpack
- Byte-swap Int32
Steim2:
- Byte-swap the UInt32 if c_k > 1

Is that right? Also, is libmseed what ObsPy uses?

chad-earthscope commented 4 years ago

Always byte-swap the UInt32 that contains the "nibbles" values

Steim1: Don't byte-swap the packed UInt32 before unpacking Int16s or Int8s Byte-swap Int16 on unpack ** Byte-swap Int32

Steim2: ** Byte-swap the UInt32 if c_k > 1

Is that right?

In addition:

swap the forward and reverse integration constants, first two 32-bit values in the first frame for both encodings.
don't swap the uint32_t if it's 4x 1-byte differences

** Byte-swap the UInt32 if c_k > 1

Not sure what that means.

Also, is libmseed what ObsPy uses?

Yes.

jpjones76 commented 4 years ago

Unfortunately, after some testing, it appears that the added control loops needed to accommodate this word order would slow Steim decompression in all cases. I don't want to do that.

With apologies, I think the best course of action for now is for me to relabel this "out of scope" and close the issue.

I still very strongly encourage you to open an issue on the ObsPy GitHub page.

jpjones76 / SeisIO.jl

read_data with mseed integrity check fails #33