Closed tclements closed 4 years ago
I'll look into this, but can you tell me a little more about how the file was created? You say that it's mseed created in ObsPy; what endianness is the machine and what's the bswap situation look like?
I'm still working on changes to fix Julia's read slowdown in 1.3.0, so this might take some time before a fix goes live.
Also, could you upload the converted SAC file?
I used obspy.get_waveforms
to download the data into an obspy.Stream
object, then write each station individually. Code as follows:
import obspy
from obspy.clients.fdsn.client import Client
import os
client = Client("RESIF")
t1 = obspy.UTCDateTime("2009-09-16")
t2 = obspy.UTCDateTime("2011-6-19")
out = "~/VOLC/MSEED/"
while t1 <= t2:
try:
st = client.get_waveforms("YA","*","00","HHZ",t1,t1+86400)
except:
st = []
if len(st) != 0:
sta = list(set([tr.stats.station for tr in st]))
for s in sta:
tr = st.select(station=s)
filename = ".".join([tr[0].stats.network,
tr[0].stats.station,
tr[0].stats.location,
tr[0].stats.channel,
str(tr[0].stats.starttime.date),
"mseed"])
outfile = os.path.join(out,filename)
print("Writing: ",filename)
tr.write(outfile,format="MSEED")
t1 += 86400
This writes Steim2 by default.
Here is lscpu
on my laptop:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 158
Model name: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
Here is the sac file
Well, I found the problem. There is no good news. It's another "feature" in SEED that isn't properly documented, and unfortunately, I no longer have time in my life to hunt and peck with coding guesses until I read the SEED creators' minds correctly.
There's a "word order" flag (not properly documented) in Blockette [1000] that affects the order in which one should assign Steim-compressed differences as they're unpacked from a UInt32.
Unfortunately, when word_order == 0x00, I don't know the proper order to uncompress the integer values. The SEED manual has only this cryptic nonsense to say about it: "The byte swapping order for 16 bit and 32 bit words. A 0 indicates little endian order and a 1 indicates big endian word order. See fields 11 and 12 of blockette 50." Those fields in that blockette contain exactly the same information stated in more words. This is mystery meat, effectively an undocumented flag.
This doesn't appear to be as simple as an "if" loop that reverses the order of unpacking when word_order == 0x00. I've tried that route and the integrity check still fails.
Unfortunately, what that means for your group is that I can't support mini-SEED with Steim variables packed in littleendian "word order" until someone shows me exactly -- I mean exactly, not some cryptic hint at -- the correct unpack order. I've spent 6 hours already tonight trying to guess what this word-salad manual is getting at and barely got anywhere. If I can't figure this out in a couple more hours, someone else will have to.
Sincere apologies.
For what it's worth, I suspect that word_order = 0x00 is very rare. This is the first time I've encountered it in ~20 years working with SEED format. I strongly suspect that the only way to even achieve that is to write to mini-SEED in ObsPy, because literally all other SEED that I've seen is bigendian.
Thanks for looking into this. Is this of any help - obspy has a record analyzer that shows the encoding format for blockette 1000.
python -m obspy.io.mseed.scripts.recordanalyzer ~/VOLC/MSEED/YA.SNE.00.HHZ.2009-12-28.mseed
FILE: /home/timclements/VOLC/MSEED/YA.SNE.00.HHZ.2009-12-28.mseed
Record Number: 0
Record Offset: 0 byte
Header Endianness: Little Endian
FIXED SECTION OF DATA HEADER
Sequence number: 1
Data header/quality indicator: Q
Station identifier code: SNE
Location identifier: 00
Channel identifier: HHZ
Network code: YA
Record start time: 2009-12-28T10:56:04.968300Z
Number of samples: 3105
Sample rate factor: 100
Sample rate multiplier: 1
Activity flags: 0
I/O and clock flags: 0
Data quality flags: 0
Number of blockettes that follow: 1
Time correction: 0
Beginning of data: 64
First blockette: 48
BLOCKETTES
1000: Encoding Format: 11
Word Order: 0
Data Record Length: 12
CALCULATED VALUES
Corrected Starttime: 2009-12-28T10:56:04.968300Z
I'll look into how I'm creating these mseed files with obspy.
Maybe @chad-iris could be of help?
I think I found the problem. The call to st = client.get_waveforms("YA","*","00","HHZ",t1,t1+86400)
returns a stream with 17 different station - channels:
st = client.get_waveforms("YA","*","00","HHZ",t1,t1+86400)
st
17 Trace(s) in Stream:
YA.FJS.00.HHZ | 2009-12-28T10:56:12.968300Z - 2009-12-28T23:59:59.998300Z | 100.0 Hz, 4702704 samples
YA.FLR.00.HHZ | 2009-12-28T10:56:17.968300Z - 2009-12-28T23:59:59.998300Z | 100.0 Hz, 4702204 samples
YA.FOR.00.HHZ | 2009-12-28T10:56:13.968300Z - 2009-12-28T23:59:59.998300Z | 100.0 Hz, 4702604 samples
YA.RVL.00.HHZ | 2009-12-28T10:56:16.968300Z - 2009-12-28T23:59:59.998300Z | 100.0 Hz, 4702304 samples
YA.SNE.00.HHZ | 2009-12-28T10:56:04.968300Z - 2009-12-28T23:59:59.998300Z | 100.0 Hz, 4703504 samples
YA.UV01.00.HHZ | 2009-12-28T00:00:00.000000Z - 2009-12-29T00:00:25.930000Z | 100.0 Hz, 8642594 samples
YA.UV02.00.HHZ | 2009-12-28T00:00:00.000000Z - 2009-12-29T00:00:20.030000Z | 100.0 Hz, 8642004 samples
YA.UV03.00.HHZ | 2009-12-28T00:00:00.000000Z - 2009-12-29T00:00:18.870000Z | 100.0 Hz, 8641888 samples
YA.UV05.00.HHZ | 2009-12-28T00:00:00.000000Z - 2009-12-29T00:00:19.350000Z | 100.0 Hz, 8641936 samples
YA.UV07.00.HHZ | 2009-12-28T00:00:00.000000Z - 2009-12-29T00:00:36.770000Z | 100.0 Hz, 8643678 samples
YA.UV08.00.HHZ | 2009-12-28T00:00:00.000000Z - 2009-12-29T00:00:37.690000Z | 100.0 Hz, 8643770 samples
YA.UV09.00.HHZ | 2009-12-28T00:00:00.000000Z - 2009-12-29T00:00:33.290000Z | 100.0 Hz, 8643330 samples
YA.UV10.00.HHZ | 2009-12-28T00:00:00.000000Z - 2009-12-29T00:00:18.870000Z | 100.0 Hz, 8641888 samples
YA.UV11.00.HHZ | 2009-12-28T00:00:00.000000Z - 2009-12-29T00:00:28.610000Z | 100.0 Hz, 8642862 samples
YA.UV12.00.HHZ | 2009-12-28T00:00:00.000000Z - 2009-12-29T00:00:19.470000Z | 100.0 Hz, 8641948 samples
YA.UV13.00.HHZ | 2009-12-28T00:00:00.000000Z - 2009-12-29T00:00:37.410000Z | 100.0 Hz, 8643742 samples
YA.UV14.00.HHZ | 2009-12-28T00:00:00.000000Z - 2009-12-29T00:00:36.570000Z | 100.0 Hz, 8643658 samples
The endianness and encoding for the three letter stations (FJS, FLR, etc ...) is different from the UV?? stations. Here is YA.SNE.00.HHZ:
In [22]: st[4]
Out[22]: YA.SNE.00.HHZ | 2009-12-28T10:56:04.968300Z - 2009-12-28T23:59:59.998300Z | 100.0 Hz, 4703504 samples
In [23]: st[4].stats.mseed
Out[23]: AttribDict({'dataquality': 'Q', 'number_of_records': 1275, 'encoding': 'STEIM2', 'byteorder': '<', 'record_length': 4096, 'filesize': 156889088})
And here is YA.UV01.00.HHZ:
In [25]: st[5]
Out[25]: YA.UV01.00.HHZ | 2009-12-28T00:00:00.000000Z - 2009-12-29T00:00:25.930000Z | 100.0 Hz, 8642594 samples
In [26]: st[5].stats.mseed
Out[26]: AttribDict({'dataquality': 'Q', 'number_of_records': 2503, 'encoding': 'STEIM1', 'byteorder': '>', 'record_length': 4096, 'filesize': 156889088})
Note that the byteorder for st[4]
is <
, while byteorder for st[5] is >
.
If I write all the files with byteorder = >
, SeisIO has no problem reading the new file:
In [31]: st[4].write("test.mseed",format="MSEED",byteorder=">")
In [32]: st[4].data
Out[32]: array([14278, 14426, 14453, ..., -3565, -3450, -3456], dtype=int32)
julia> S = read_data("mseed","test.mseed")
SeisData with 1 channels (1 shown)
ID: YA.SNE.00.HHZ
NAME: YA.SNE.00.HHZ
LOC: 0.0 N, 0.0 E, 0.0 m
FS: 100.0
GAIN: 1.0
RESP: a0 1.0, f0 1.0, 0z, 0p
UNITS:
SRC:
MISC: 0 entries
NOTES: 0 entries
T: 2009-12-28T10:56:04.968 (0 gaps)
X: +1.428e+04
+1.443e+04
...
-3.456e+03
(nx = 4703504)
C: 0 open, 0 total
julia> S[1].x
4703504-element Array{Float32,1}:
14278.0
14426.0
14453.0
⋮
-3565.0
-3450.0
-3456.0
Seems that obspy is writing byte order incorrectly for these particular files.
The SEED manual specifically forbid anything but big-endian Steim compression. So, pedantically, those data (I presume from RESIF) are not valid SEED. One consequence is that there is no technical description of what little-endian Steim compression is. That being said, libmseed (used by ObsPy) and qlib2 both support the same flavor of little-endian Steim, and it's only defined by the software that writes it. So tools built on these libraries generally just work. This flavor of little-endian Steim existed before the FDSN declared that Steim was big-endian only. So that's the brief history of the situation, and these are the variations you find when re-creating the wheel of parsing miniSEED.
For maximum compatibility and adherence to the spec., Steim should be big-endian. It seems ObsPy is writing it out in the same endian it's read it. FWIW, the next major version of libmseed will not write little-endian miniSEED 2 at all, to avoid this confusion, it's just not worth any perceived advantage of "native" byte order.
@chad-iris, do you mean that "little-endian word order" (Blockette [1000], field 4) is ambiguously defined in the SEED documentation? That's the problem I can't resolve with how these data are written. For example:
Should we open an issue with the ObsPy team to warn them about this? Writing Steim little-endian mini-SEED by default can undercut the FDSN standard, simply because ObsPy is so popular. It's certainly not just @tclements writing mini-SEED files; I'd imagine that ~10^3 other users are doing this on any given day.
@chad-iris, do you mean that "little-endian word order" (Blockette [1000], field 4) is ambiguously defined in the SEED documentation?
There is no clarity on what this flag means to each of the encodings. For the fundamental float/integer types it's rather obvious, but for encodings with quantities that are not byte-oriented, like Steim, it's undefined what that flag means.
For "little endian" Steim 1&2 the flavor used by qlib2 (and libmseed following it) is to byte swap the 16 and 32-bit quantities including the 32-bit word of nibbles and the 32-bit words containing quantities of anything but 8-bits. While it reads strange it's really just the same byte-swapping you need to do to read/write miniSEED on a machine of the opposite order, e.g. big-endian miniSEED on a little-endian machine. Put another way, you'll do the same byte-swapping to read the data on both architectures as you would do to write the data in both endians, so not so strange. A pragmatic suggestion, should you want to support/understand this, is to read the libmseed unpacking source and see where the swap
ing is done.
Should we open an issue with the ObsPy team to warn them about this?
Feel free. A couple things to note: a) the input data was little-endian Steim, which is quite rare, and is the source of the "problem", this is probably why ObsPy wrote it that way, and b) it is not immediately a problem or noticeable by many because a lot of software (based on the mentioned libraries) can read it.
For "little endian" Steim 1&2 the flavor used by qlib2 (and libmseed following it) is to byte swap the 16 and 32-bit quantities including the 32-bit word of nibbles and the 32-bit words containing quantities of anything but 8-bits. While it reads strange it's really just the same byte-swapping you need to do to read/write miniSEED on a machine of the opposite order, e.g. big-endian miniSEED on a little-endian machine. Put another way, you'll do the same byte-swapping to read the data on both architectures as you would do to write the data in both endians, so not so strange.
I looked at this last night but was having trouble with an equivalent Julia implementation. The logic of swap operations in your code appears to be:
Is that right? Also, is libmseed what ObsPy uses?
- Always byte-swap the UInt32 that contains the "nibbles" values
- Steim1: Don't byte-swap the packed UInt32 before unpacking Int16s or Int8s Byte-swap Int16 on unpack ** Byte-swap Int32
- Steim2: ** Byte-swap the UInt32 if c_k > 1
Is that right?
In addition:
** Byte-swap the UInt32 if c_k > 1
Not sure what that means.
Also, is libmseed what ObsPy uses?
Yes.
Unfortunately, after some testing, it appears that the added control loops needed to accommodate this word order would slow Steim decompression in all cases. I don't want to do that.
With apologies, I think the best course of action for now is for me to relabel this "out of scope" and close the issue.
I still very strongly encourage you to open an issue on the ObsPy GitHub page.
Reading this mseed file prints integrity warnings:
RDMSEED: data integrity -- Steim-2 sequence #001194Q integrity check failed, last_data=2250.0, should be xn=2518.0
The file is written with Obspy. The first sample is the same with both readers:
Converting the file to sac in obspy and then reading the sac file in SeisIO gives the same data as the mseed file read in obspy