mspass-team / mspass

Massive Parallel Analysis System for Seismologists
https://mspass.org
BSD 3-Clause "New" or "Revised" License
30 stars 12 forks source link

Handling miniseed data #101

Open pavlis opened 3 years ago

pavlis commented 3 years ago

In a phone call yesterday we decided we needed to take some action to deal with a fundamental flaw in obspy's miniseed reader. For the record the issue is that although the reader can accommodate large files with many seismograms combined it does so with the most memory intensive algorithm possible. It reads all so a large file can easily generate a memory fault. It does this because their reader contains an implicit assumption similar but not as bad as SAC; use the file system to organize data so the expectation is many small files. As we know that collides badly with hpc system with large disk arrays where performance drops badly with handling large number of files. In fact, systems like lustre use a database much like we are in MsPASS to simulate a file system.

Long background, but I wanted to introduce this issue for the record. What we need is a mechanism to build and index with file offset (foff) for any file that can be used to read a file in pieces. A good model is css3.0 wfdisc that use foff, dfile, dir, and nsamp to index miniseed files. We don't want to dupllicate wfdisc exactly but it defines the idea of the index we need for MongoDB.

The first step is how to deal with miniseed data? We have two options:

  1. Patch obspy's reader for our needs.
  2. Write our own lower level reader.

Both solutions will require us to use the miniseed library supported by IRIS DMC in general and Chad Trabant in particular. The code is on github here. I am about 99% sure obspy's reader uses this library with some python bindings I haven't sorted out yet. So, my next step is to reverse engineer obspy's reader to see if we want to just hack their code or start over. I've actually used Chad's C code before in a reader I wrote for the LIGO people so I don't think this is a particularly horrible problem. It would be insurmountable if we didn't have access to a library like this. SEED has been called an (a) abomination and (b) a four-letter word for a good reason.

pavlis commented 3 years ago

Dug into obspy's reader a bit and hit something for which I hope you may be know. The obspy reader's documentation for _read_mseed contains this statement:

mseed_object Filename or open file like object that contains the binary Mini-SEED data. Any object that provides a read() method will be considered to be a file like object.

I've run across that idea of a "file-like object" in python documentation before but do not fully understand what the limits and practical application is. My leap for a possible implementation from this statement is perhaps we could:

  1. Build an index of dir, dfile, foff, and nbytes defining chunks of data in a large file.
  2. We could then build a MongoDB reader on top of obspy's reader (if I read this right) by doing something like an fread of nbyes data an calling the reader on some memory buffer.

There is an existing solution for item 1 in Antelope. That is, in a couple hours I could probably cobble together something that would translate a wfdisc to MongoDB documents that define an index. Problem is building the wfdisc would require running antelope's miniseed2db program, which collides with open source. In this model I would probably still have to use libmseed to build the indexer. That should, however, be way simpler than all the complexity I see built into the obspy reader for handling various oddities of seed files. Then again, once I had to dive into that it might be just as easy to build TimeSeries objects in C++. Won't know until I dig into this, but a key question is clarifying for me what a 'file-like object' means.

wangyinz commented 3 years ago

I am not familiar with "file like object" neither, but based on my reading, the steps you listed above should work. We should be able to read a chunk of a large SEED file and turn that into a file-like object (probably with the BytesIO class here).

pavlis commented 3 years ago

Dug a bit deeper into the obspy miniseed reader and libmseed. I conclude:

  1. We do NOT want to mess around with obspy's reader. It is very complex with some fancy feature light years beyond my experience in python. It is not at all clear to me how they have bound libmseed into their code though. It is all in this set of imports:

    from obspy import Stream, Trace, UTCDateTime
    from obspy.core.compatibility import from_buffer
    from obspy.core.util import NATIVE_BYTEORDER
    from . import (util, InternalMSEEDError, ObsPyMSEEDFilesizeTooSmallError,
               ObsPyMSEEDFilesizeTooLargeError, ObsPyMSEEDError)
    from .headers import (DATATYPES, ENCODINGS, HPTERROR, HPTMODULUS, SAMPLETYPE,
                      SEED_CONTROL_HEADERS, UNSUPPORTED_ENCODINGS,
                      VALID_CONTROL_HEADERS, VALID_RECORD_LENGTHS, Selections,
                      SelectTime, Blkt100S, Blkt1001S, clibmseed)

    in the file core.py of obspy.

  2. I am pretty certain that if we could build an index and use something like the BytesIO class you noted above we could read miniseed files one seismogram at a time with an foff value. Might also be possible to just stuff the bytes into gridfs and index them with MongoDB. Then a reader could just pull the bytes from Mongo and run obspy's reader to crack the format and convert to working data. There is a potential gotcha, however, in that miniseed records can be multiplexed. I am not sure if obspy's reader handles those either. That is a side issue though. I think a key thing is to test this reading from with an index with obspy before trying to build a custom indexing program. Not sure, but I think I might built an index with a wfdisc table and play around with this a bit before I dive too deep into libmseed.

  3. libmseed is the final point. What a beast! It was trivial to build and run Chad's tests, but how to use the library is really really confusing even with his examples, of which there are a number, as a guide. When I know more about what I need to ask him I will plan to call Chad for help.

pavlis commented 3 years ago

I figure out how obspy uses libmseed, but I have no idea how they set this up. I presume they are doing some magic with their installer to build some external libraries from source code. For the record, they store a set of compiled libraries in obspy/lib. The contents of lib are the following on a mac:

__init__.py
__pycache__
libevresp_Darwin_64bit_py36.cpython-36m-darwin.so
libgse2_Darwin_64bit_py36.cpython-36m-darwin.so
libmseed_Darwin_64bit_py36.cpython-36m-darwin.so
libsegy_Darwin_64bit_py36.cpython-36m-darwin.so
libsignal_Darwin_64bit_py36.cpython-36m-darwin.so
libtau_Darwin_64bit_py36.cpython-36m-darwin.so

The python that manages these is obspy/core/util/libnames.py. Somehow they create so files that can be loaded dynamically with code in libnames with import. I think what this means is the authors of obspy built their own binding code to expose these libraries to python.

wangyinz commented 3 years ago

Yes, I take a look too and it actually uses Python's ctype class to directly operate with the dynamic library. This is the line that does the magic. Anyway, we cannot modify ObsPy's code and include it in our repository anyway after we changed our license...