schema development - Githubissues

Since I am really close to a having a working yaml parser to build the MetadataDefinitions object we need to get serious about how to build the default schema. First, I presume we should refer to that definition with the tag "mspass".

Next is how to define the default namespace? I suggest we proceed as follows:

Start by cloning names used by obspy. Since for the current project that is viewed as a basic building block that would only make sense. Agreed? If so, where do we find the full set of conventions? A hole, I think, in their documentation is they do not have a single source to define all stats keys used in the distributed code. For testing I already created a file for the required attributes in the Trace object, but not sure how to find others.
We need to not focus on keywords but "concept" to sort out a set of established concepts used for seismic processing. I suggest we proceed as follows: a. Go through the SAC header attributes and extract additional concepts. b. Review the antelope/datascope schema and extend c. Review seismic unix header values and extend.

If you agree with this, then when we do this assembly we should make sure we record the keys used by each of these three main systems for indexing that concept. e.g. the station code is "station" in obspy (I think), "sta" in antelope, and "KSTA" in sac. These are important because I think we can and should build "aliases" definitions that are a convenience for the user. i.e. we should make it fairly easy for the user to internally used sac, obspy, or antelope namespace aliases if desired. Undoubtedly not the first thing to implement, but doing this now will be trivial while digging it out later would require repeating the same exercise. The current MetadataDefinitions would make setting this up easy as it requires only a series of lines like this: aliases: station sta station KSTA station site.sta

(Note-needs indents and I don't see how to do that in this text editor.)

Once we have all the concepts sorted out we should probably put them in some logical grouping. yaml has a nice feature that can help us with "pages" separated by --- lines. i.e we could then group the definitions into "pages" that would make searching easier. Alternatively, we might think about an extension to MetadataDefinitions that (optionally) group attributes by category. A trivial C++ class as all it needs be is a multimap between group key and attribute keys.

That aside, what is logical may not be clear until we complete this process. a priori some options I can think of are:

passive and active source attributes + maybe general
source, receiver, waveform, field/raw (i.e. more useful for processing of field data but generally useless in waveform process - e.g. "commtype" might be used to flag a station as "sattelite", "cell modem", "standalone", etc. defining how and if a station communicates. Not needed immediately, but examples might be helpful to encourage iris passcal to use our system.), and something generic like "other".
Group by original source: obspy, antelope, sac, segy, and any other. I think this would be a bad idea as the large overlap would make this very confusing.

I tend to think 2 would be the most logical, but you should think about this. It might actually be smarter to no make that decision until we've assembled all the data.

This got long, but it is an important next step to do right

Need to think carefully on this one, but I do tend to choose 2, too, as that seems to be a more rational scheme.

New thought on this after the changes in format I made developing the yaml parser. The idea I want to put forward is to have attribute groups embedded in the yaml file. The names could be encoded into the MetadataDefinitions object (with some minor work, but which would increase the size of the object a bit by requiring a multimap) or just used to make the file more readable. I tend to think the later would be better as I don't actually see a major reason to make the groupings appear significant. As noted there are lots of ways to group names together because many concepts overlap. e.g. latitude as a concept is one thing but it need to be used for multiple contexts in seismic processing.

In any case,if we do this the yaml file might look schematically like this:

core:

name: npts ... source:
name: slat ... extensions:
name: my_special_parameter

Would require some minor changes to yaml parser I need to do anyway based on your comments on the pull request. We already decided we need the last one (extensions). This just makes more keys.

Just checked in revisions that implement the group tags as noted above. For now the group tags are just labels in the yaml file and do not propagate downstream. I think it should stay that way unless we find a good reason to do anything else.

I will take on building the first draft of the schema file unless you are anxious to do so. My experience probably makes it easier for me. Confirm and I'll start in on that task.

Defining Supported Metadata Keys for mspass

Action Items: The document below evolved from work I was doing to sort out how we should set up the attribute namespace (data for MetadataDefinitions object). It quickly became a tar baby and is yet another lesson of many I’ve learned over the years about what an absurd mess this particular problem it in any field. We are on the edges of the ugly world of an “ontology”. Suspect working at TACC you have run across this word, but from my experience and ontology is every bureaucrats dream and every practical scientists nightmare. That is, the first action I thus recommend is that we not seek outside help on this issue until we have a working prototype that keeps any such discussion in focus. If we don’t we’ll get into useless discussions of formats, names, and who knows what else.

This got longer than I expected when I started writing this down. Some of my ideas evolved as I wrote this down, so there may be inconsistencies. So when I was finished I went back and pulled these items we will need to settle before we can proceed efficiently. These should be considered definitive with background later for the record.

We need to focus the problem to limit the universe for our initial development or the namespace will get out of control rapidly. I think the answer is we are aiming to produce fast waveform processing workflows. Data management or raw data is an external problem with multiple, working, and competing solutions. All interactions with such things are imports. Do you agree?
We need a flat namespace convention. I recommend we adopt a groupname.attribute name convention to make names like source.latitude and receiver.latitude obvious and easily remembered.
We need to consider the possible role of saving static quantities like receiver coordinates in a common place like a single “document” in mongodb. This maps into the obspy approach to data handling, but I suspect it could create some bottlenecks if done without care. It definitely could complicate the data handling model. On the other hand, it might be possible to completely isolate this detail to readers and writers.
How do we define the master names? I would recommend strongly we use extended CSS3.0 names and alias everything else. The reason is that it is the only community standard namespace. All other are implicit standards from popularity of usage. This makes the choice easy to justify and greatly simplifies the decision making process. It will still require some extensions for active source data as css3.0 lacks any concept of active sources except nuclear explosion “events”.

Motivation Interactions with MongoDB for C-based algorithms require attention to the type of any simple (i.e. any quantity defined by a number or character string) attribute. At the same time to provide a framework for novel processing methods, the framework needs to support generalized “types”, which in the python world means class/object. We can and should start with simple types, but make sure the framework can support anything. It is pretty clear that is so with a combination of python dict containers and the new generation of Metadata we are using that uses boost::any to deal with unraveling any generic type. The documentation claims that boost::any works with any class that is copy constructable, which is true of anything I can think of that is appropriate for storage as Metadata. (Examples of things that wouldn’t work are most file handles that don’t work in a multithread environment so cannot be copied but require “move” semantics.)

The C library can, in principle, support any type it knows about through the MDtype enum class. Right not that only contains simple types, but there is no reason it couldn’t be expanded to some simple composite types like some of the std containers (set, map, vector, etc.). We should give some thought sooner rather than later what composite types might be needed so they can be wired into the code now.

We need to start with standard scalar attributes that encompass most of what seismologists do. The purpose of this document is to describe how we sorted that out and define the initial set of supported attributes.

Approach There are a large number of software systems out there to handle seismic data of all kinds. The number gets really huge if you include seismic reflection processing, which is necessary but can probably be best fulfilled by working with IRIS-PASSCAL data folks who routinely work with active source data of all kinds.

Based on what software we know people use from IRIS surveys, I elected to start with four primary packages to address this question

The Seismic Analysis Code (SAC) which IRIS survey shows is far and away the most commonly used software in our community.
Obspy, which is becoming the new SAC.
Antelope’s implementation of the CSS3.0 relational database schema. Their core namespace is exactly the attribute names in the CSS3.0 standard and thus can and probably should at least be a standard alias for many attributes.
Seismic Unix is centered on a modest variant of SEGY. PASSCAL has traditionally used segy attributes for active source experiments although they are currently moving to a new format they call PH5 that is a superset of HDF5. It is clear they have spent a lot of time rethinking the limitations of segy and have come up with what is clearly a superior namespace for active source data. This URL defines the keys clearly: https://github.com/PIC-IRIS/PH5/wiki We should use PH5 as a reference point for active source an consider segy and seismic unix only peripherally.

A problem we face is all of these systems have different names for the same concept. Worse is the fact that they have colliding concepts. For example, every one of them has a different way of managing start time of a waveform. Antelope uses the simplest approach of a epoch time. Obspy uses a class (UTCTime) with methods to spit out date strings in various formats or epoch time. SAC is just weird and I’ll say no more. The active source universally uses shot time as a reference with an optional lag (can be negative) for the time to the first sample. Much of that “collision of concept” comes from different systems being optimized for different problems and being born with different technological constraints. In any case, there is a challenge to sort out what is core and what is up to a user to add if they need to extend the framework. A challenging tradeoff because we want to be as inclusive as possible, but not put shackles on development into areas with which we aren’t as familiar.

A final issue is flat versus hierarchic namespaces. I hadn’t appreciated this until I dug into this. SAC and SEGY are traditional flat namespaces with every field in the SAC header having a name assigned to it. None of the others above are. Obspy uses a python class structure to handle metadata. That is, they have a network, station, channel, and event hierarchies that they use to provide attributes like station coordinates, event coordinates, and channel orientations. Antelope uses the relational database paradigm to group attributes in a virtual hierarchy: site table, sitechan table, etc. Consequently, they resolve potentially ambiguous words like “latitude” to site.latitude, origin.latitude, etc. obspy classes, of course, use a similar symbolic form to refer to comparable attributes (i.e. the latitude example) but the context of station.latitude in python is very different from “site.latitude” in Antelope. Finally, PH5 is built upon HDF5. Given that HDF=Hierarchic Data Format, it is no surprise you find key words like receiver_t.azimuth, perhaps better cast symbolically as receiver_t->azimuth, in the PH5 documentation. The HDF5 documentation is enormous so I’m not yet sure how that API actually refers to parameters like this. It really doesn’t matter, however, as PH5 like SAC and SEGY files must be viewed as an external format data set that will need to be translated to a form mspass can handle no matter what. We do not want to make the IRIS-DMC error of building the entire framework around a fixed data format=SEED.

The unambiguous conclusion is that a flat namespace format is archaic and needs to be supported only as a convenience. Hierarchic names like those used by Antelope (e.g. wfdisc.time) are an expedient syntax for relational databases since any piece of information has to be linked to a table (relation) and attribute name. Things like HDF5 were clearly developed with relational databases in mind as the grouping concept is similar. It is more generic, however, as the hierarchy can go to more than just two levels. (I don’t see any examples in PH5, but obspy effectively does in their class structure). In any case, a critical design decision right now is how to handle the hierarchic structure necessary for many times of metadata? As far as I understand it, we need this for a nonSQL database because all entities are indexed with simple key:value pairs. I guess the alternative is to have separate documents to store related, static Metadata. E.g. maybe we should have a “source” and “receivers” documents that would map to things like css3.0 event->origin and site->sitechan respectively? This could seriously complicate read and save operations, however, as functions would have to sort out which attributes to put where. An item for discussion, but for now I’m going to assume we want to have a simple, flat namespace that is easy to understand and manipulate.

For simplicity, I suggest we require all symbols to be in this form: groupname.attributename where groupname is some keyword like “receiver” used to define a group of related attributes and attributename is exactly what it spells out. This would allow a clean, easily remembered namespace convention. An example where this convention would be useful is receiver.latitude and source.latitude, which could clearly both as an earth coordinate but for two different things (receiver and source).

I’m thinking that to make this easier for people to work with we should demand all Metadata keys have this implied hierarchy. e.g. rather than use npts directly as a key in their stats dict as obspy does, we could insist it be referenced with a name like seismogram.npts in the database. Readers and writers could, through the aliases mechanism we designed, use shorter names by default for some symbols. e.g. seismogram.npts could internally just become npts. Not sure that would be wise for supported algorithms, however, so I would recommend we decide that all our code will use a groupname.attributename convention. This is a design decision we need to make and document here on github and/or other documentation we build for this project.

As this document evolved, I realize the syntax itself could allow extensions to the framework with no changes in algorithms using mspass. I can think of two immediately:

If we decided it would be helpful to store related metadata in different documents, we could modify reader and savers (none of which are yet written) to handle a set of group names specially. E.g. a writer could check the value of receiver.name (station name) and if it already existed ignore it. Only if the name was not yet defined would it be saved. (detail – actually in that example it would need to check at least the network code and station code given the modern stock conventions). As noted, this might be ugly baggage that would only slow processing from excessive db transactions.
More levels of hierarchy could be supported than two, although I can’t come up with an example that makes any sense that would require that. Idea though is there is no reason we couldn’t name an attribute foo.bar.glp and have it imply a three level hierarchy.

This is really a long thread that I barely find the slot in my fragmentary holiday schedule to fit in the reading, thinking and replying. Anyway, here I am. I am glad that you've been making quite a bit progress on this. Let me write down some of my thoughts below:

We need to focus the problem to limit the universe for our initial development or the namespace will get out of control rapidly. I think the answer is we are aiming to produce fast waveform processing workflows. Data management or raw data is an external problem with multiple, working, and competing solutions. All interactions with such things are imports. Do you agree?

Yes, I completely agree. This is actually the planned approach in our proposal anyway - to start with and focus on a couple workflow that we are familiar with, and I think waveform processing is exactly the one.

We need a flat namespace convention. I recommend we adopt a groupname.attribute name convention to make names like source.latitude and receiver.latitude obvious and easily remembered.

This is actually a condensed statement from the exhaustive discussion later on. While I don't have a definitive answer to this one, I do think it is actually a question of how we want to design the data model. In MongoDB, there are actually two different data models: the Embedded Data Model and the Normalized Data Model (ref). The former is more like the flat namespace you are referring to with the addition of embedded documents, and the latter is more like a relational database. You can see that we can actually implement hierarchy in either of the two. In a lot of the applications out there, I believe the majority are actually done in a mix of the two, and the decision of which being better is mostly a trade-off between read and write performance. I guess that means we might need to better understand the data access pattern of different metadata first. However, as you mentioned already, the nice thing is that the syntax of foo.bar is flexible enough for us to change the data model internally when needed.

We need to consider the possible role of saving static quantities like receiver coordinates in a common place like a single “document” in mongodb. This maps into the obspy approach to data handling, but I suspect it could create some bottlenecks if done without care. It definitely could complicate the data handling model. On the other hand, it might be possible to completely isolate this detail to readers and writers.

I think that is a reasonable design. I don't think it would complicate the model too much - this is just implementing the normalized data model for those static quantities.

How do we define the master names? I would recommend strongly we use extended CSS3.0 names and alias everything else. The reason is that it is the only community standard namespace. All other are implicit standards from popularity of usage. This makes the choice easy to justify and greatly simplifies the decision making process. It will still require some extensions for active source data as css3.0 lacks any concept of active sources except nuclear explosion “events”.

I agree. I guess we don't necessarily need to worry too much about the active source data beyond the most important ones at this stage, which are probably the ones in the SEGY (or maybe PH5) format.

Sorry for overwhelming you over the holidays - I actually had a quite a bit of time to spend on this. Happy to see you didn't drop everything to respond - that is how you should prioritize your time.

Responses to close/extend each of the 4 items I listed as action items:

Build waveform processing workflows is step 1. Can do once we finalize a prototype namespace/schema design.
The name convention of a.b where a is a group/relation/table/document and b is an attribute name is sufficiently flexible we will use that convention for names. It is an implementation detail we need to work for how this maps into the database. I think you are right that we will likely end up with a hybrid with static data (e.g. css site or sitechan table data) stored in a separate document but data linked to any object in the processing flow (i.e. Trace, TimeSeries, or Seismogram objects) should be dynamic without worries about duplication. Can't seem to stay concise on this, but a.b syntax in combination with 4 below will allow me to start filling out a prototype schema.
Addressed in 2: we'll implement static documents and a form of the normalized data model after we have more experience in MongoDB interactions.
i'll build the prototype names anchored on the css3.0 names.

Messed around with this a bit this morning and have a new suggestion Perhaps easiest to demonstrate with a change in the yaml configuration for MetadataDefinitions such as this example for sta:

name: sta type: string concept: Seismic station code aliases: station arrival.sta assoc.sta site.sta sitechan.sta wf.sta KSTA ksta master: site.sta mutable: false

This example adds two new components to the struct(class) for MetadataDefinitions:

master - defines the name where the attribute would always be stored. For this example as site.sta, which could be written/read directly with that tag or if using a separate table the document site and attribute sta.
mutable - initial thought is a boolean that tells a writer that the attribute should be saved if it changes. If false, it is immutable and essentially read only. A reader would load the attribute either directly as site.sta or indirectly via document site and tag sta.

The idea is that we need ways to:

Guarantee consistent internal names - here "name" would be the internal tag.
Guarantee attributes stored with any of possible aliases all get mapped to a unique database name.
Provide a clean mechanism to reduce duplicates in data object and simultaneously assure uniqueness in master copies of potentially ambiguous names. master and something like mutable are my first thought of how to assure this, but there may be gotchas I haven't anticipated.

Some potential changes:

With this model "name" should perhaps be changed to "internal_name"
"master" should probably be called "master_document"
Perhaps mutable should be changed to a larger range of options? Might just complicate things, however. With the current model the idea is things like sta could only be written by specialized writers that would override the rules defined in the MetadataDefinitions object.
A different approach or an additional thing that we might want to define for an immutable attribute is the a link key. sta is a good example, because in modern data sta is not a sufficient that to define a unique instrument. Requires minimum of net, sta, and chan, but also there is a time range for calibrated observatory quality data. The clear need is something like a sta_id, but as the documents you point to above suggest we could just use object_id MongoDB guarantees to be unique. A "site" document then might have this kind of typical entry: { sta: "AAK", lat: 37.2249 lon: 78.2149 elev: 1.245 starttime: something endtime: something id: }

Data objects could save the site.id data, but if the attribute was marked immutable it would be unnecessary and be blindly ignored. Hence, if we follow this model we probably need something like a unique_id tag for the MetadataDefinitions configuration. i.e. if we decide on this model the above yaml lines for sta would go to this:

name: sta type: string concept: Seismic station code aliases: station arrival.sta assoc.sta site.sta sitechan.sta wf.sta KSTA ksta master: site.sta mutable: false unique_id: site.id # note this would imply the site document

The 4 potential changes above probably only confuse this. The master and mutable idea is the main thing to decide upon. These mesh perfectly with the new interface routine in a branch merge that I submitted last night. i.e. it would be fairly easy to add this functionality to that core routine needed by writers and updaters. A reader will need some of the extensions if we decide to use a master document for parameters like sta. The more I think about it the more I think that will be essential and not add a huge overhead if we manage it properly.

Ian, I've thought about this some more and am suggesting we modify the previous MetadataDefinitions class in the C++ library in two ways:

Add an api to implement an immutable/readonly lock on some parameters.
Add methods that would abstract the idea of a normalized data model.

For item 1 I suggest this set of four methods be added:

/*! Check if a key:value pair is mutable(writeable). Inverted logic from similar readonly method.

\param key is key used to access the parameter to be tested. \return true if the data linked to this not not marked readonly. (if the key is undefined a false is silently returned) / bool writeable(const string key) const; /! Check if a key:value pair is marked readonly. Inverted logic of similar writeable method.

\param key is key used to access the parameter to be tested. \return true of the data linked to this keys IS marked readonly. (if the key is undefined this method silently returns true) / bool readonly(const string key) const; /! \brief Lock a parameter to assure it will not be saved.

Parameters can be defined readonly. That is a standard feature of this class, but is normally expected to be set on construction of the object. There are sometimes reason to lock out a parameter to keep it from being saved in output. This method allows this. On the other hand, use this feature only if you fully understand the downstream implications or you may experience unintended consequences.

\param key is the key for the attribute with properties to be redefined. / void set_readonly(const string key); /! \brief Force a key:value pair to be writeable.

Normally some parameters are marked readonly on construction to avoid corrupting the database with inconsistent data defined with a common key. (e.g. sta) This method overrides such definitions for any key so marked. It does nothing except a pointles search if the key hasn't been marked readonly previously. This method should be used with caution as it could have unintended side effects.

\param key is key for the attribute to be redefined. */ void set_writeable(const string key);

We could probably drop the inverted logic methods and implement the capability only the wrappers, but might make more sense to just code them in C++ as it is a one line wrapper around other. Would be easier for the user, I think.

For item 2 I think we might follow the example in the MongoDB documentation that guided the previous comment and center this on a "unique_id" method. Suggest we define this as these two methods:

/*! \brief Test if a key:value pair is set as normalized.

In MongoDB a normalized attribute is one that has a master copy in one and only one place. This method returns true if an attribute is marked normalized and false otherwise (It will also return false for any key that is undefined.). / bool is_normalized(const string key) const; /! \brief Returns a unique identifier for a normalized attribute.

In MongoDB a normalized attribute is one that has a master copy in one and only one place. This method returns a unique identifier, which we define as a string of unspecified format, that can be used to identify a unique field in the database. This method should normally be used only on read operations to select the correct entry for what could otherwise be a potentially ambiguous key. */ string unique_id(const string key) const;

Only readers and writers will want to care about any of these methods. As my protodocumentation says the normalization stuff should be treated strictly as useful for readers and not used by writers other than specialized programs to build things like a document with station coordinates. \

Let me know if you think I should proceed with fleshing this idea out. I've already modified the include files to reflect this suggestion. The yaml file structure will fall out naturally once we decide if this functionality is good or not. I suggest leaving what the unique_id method returns as an implementation detail once we play with this a bit. It may make more sense for unique_id to return a class/struct than requiring a program to parse some fragile name convention. I left it simple for now since I wasn't totally sure how that should be done.

Also emphasize the schema doesn't need any of that for certain. The embedded data model could be done by just never marking any attributes as normalized or readonly.

Let me know your thoughts on this so I can proceed.

You are way ahead of me on this. It took me a while to think through and understand these changes. I think I get some of the points, but not sure if I understand the design correctly. Basically, there are three added fields in each MetadataDefinition entry:

...
master: site.sta
mutable: false
unique_id: site.id

master key specifies the master document (or table) that stores this MetadataDefinition entry. mutable basically determines whether this MD entry will be written back to the database. unique_id will be used when using the normalized data model (or relational) to specify where to lookup the ObjectID of a Metadata class.

If I understand them correctly, I think all these are very important addition to the original design. One thing I am not sure about is the value of the master key, which in the above example is site.sta. According to my understanding above, the value should really be site, as that is the valid document (or table) name. The sta is the name of the MD entry, which is just a key in the site document. Is that right?

I jumped into this completely because it was pretty clear I had the far better background to solve the problem here - the db api and the schema. I wrote all that stuff here because (a) I needed it to sort out own thoughts on the problem, and (b) this issues feature of github is a good way to permanently document ideas so they don't get forgotten.

Anyway, you have this as correct as could be expected with my rambling comments. Your point about unique_id is insightful as it made me realize my own ideas area bit vague on this point. The point is we need a way to make sure each unique key in the MetadataDefinitions is unambiguous wrt to how CRUD operations should be handled. The normalized data model makes this ore complicated, particularly when we want to make it generic. My aim will be to make sure it works in the MongoDB framework and make the design as flexible as possible.

I'm going to implement the API revisions I described yesterday and experiment with this. We'll leave that on the experimental branch in case I reach a dead end.

I think I have a pretty good initial design in the branch I just checked in and for which I issued a pull request. Probably will need some more tweeking, but it implements things discussed above.

Some key points:

The approach here will require us to produce some management code to help us (and users) build and maintain a working database with a subset of the data antelope implementations put in their "dbmaster" directory. That means initially site, sitechan, and the union of event and origin. In this design each of those css tables (with modifications) would map to a collection with similar names. The yaml file submitted has: site, sitechan, and source (I think that is what I settled on, but might be origin). The overall concept is we would build these once and they would be static.
We need to complete a design a wf collection (table) that has all the things we want to link universally with waveform data. I have the core in required obspy Trace object attributes, but there are clearly others - especially for 3C data objecs. The big point, however, is that: (a) there will be one document in the wf collection for each data in the data set, and (b) each wf document will normally contain a siteid and source_id to link to the unique receiver and source document to which they are related. chanid would make sense for TimeSeries of Trace objects, but not for 3C (Seismogram) data.

Let's assume in this comment that we are aiming for a pure database read and write approach (i.e. no serialization of metadata). What this does is that to read a waveform stored in the database would require these basic steps:

Load everything (or a presecribed list) of attributes from the wf document for each data object form a data set (RDD). This would form the core Metadata for that object.
Use the ids to query for the single document matching the id stored in wf. Read all or requested attributes from that document (e.g. a siteid would normally be used to load receiver coordinates).
By an unspecified process, although likely pickle and gridfs, load the sample data. Somehow that would need to interact with constructors to build a complete object to form that RDD member.

Writers are a different story and complicated by how much we want to do update bookkeeping. For this model assume we will just dump all posted Metadata and not worry about what will be duplicates. That makes a writer simple. For each seismic object to be saved

Ask for a new ObjectID and create an empty wf object.
Use the new function I wrote that returns "all" Metadata as a python dict - might need add a writeable method that returns all attributes not marked readonly.
We would want to override readonly for the linking ids. Almost certainly would want to save those.

This is kind of mixed normalized an embedded model. We'd need some way to tell readers of partically processed data to not do the work to load the metadata from these static collections. I don't think that is hard - probably a required boolean in the wf document.

Reaction?

I think having a "dbmaster" is necessary. To provide users a easy way to generate that, we probably need to consider making it work with the FDSN web service, and think of the best way of supporting the StationXML and QuakeML format. I don't have much experience with these formats, but I think we can borrow a lot from ObsPy or likely just build on top of it.

For the wf collection, each document should either referenced by siteid (for 3C data) or chanid (for a trace). The source_id should be optional since things like the ambient noise data won't have that.

The read and write operations look fine to me. We do need to think about how to handle the duplicates. Probably duplicate does not really exist or at least is not really an issue since we do have the design to save some other metadata (e.g. algorithm, parameters used for the algorithm) to retain some provenance information for any documents in the database. We just need to figure out the best way for bookkeeping.

This thread is already long, but what I want to raise fits here so I don't want to create a new issues page. This comes up because the database api is our agenda item for our conference call his week.

I want to suggest we add a new collection to the schema with the name arrivals. This collection would be used for normalization of wf when arrivals are needed. I think we should treat arrivals as something imported not created. The one most solved problem in seismology is network catalog creation. That problem is the central purpose of the CSS3.0 schema used in antelope, and other relational systems used by the usgs directly, indirectly in earthworm, and by the isc, ims, and aftac (i.e. global catalog producers). The relational model has proven merit for that problem so let's no try to compete there.

Some basic attributes in arrival: phase - phase name time - epoch time that defines this arrival type - some word that describes how the time was produced. e.g. could be 'measured' or 'predicted' model - used only if type is predictd method - method of travel time computation for a theoretical prediction ema,azimuth - measured emergence angle and azimuth (as in css3.0) ux,uy - slowness vector components (css attributes use a polar form I hate) site_id - link to site collection of place on the earth with which the arrival is associated source_id - link to source (I think that is the name) collection for the event linked to this arrival

It might be better to have two collections: maybe call them 'picks' and 'arrival_times'. I think that might be awkward but I'm not sure. Because a mongo document doesn't have to have every attribute defined for that collection I think that would just be baggage.

An issue is how hard it would be in MongoDB to do a join of arrival with site and source. I think it is standardized, but it has been months since I looked at MongoDB's documenation

I want to suggest another collection. The need for this collection became apparent in designing ways to handle data downloaded from IRIS by different mechanisms.

The issue is that I think we should limit the wf collection to use during a working dataflow. i.e. it should be used only for initial, intermediate, and final data processing results. We need a different way to handle raw data assembled from various sources. Currently, that would mainly be local data archives, data downloaded from IRIS via web services, and data acquired from IRIS by other request mechanisms. Today that automatically means the raw data would normally be miniseed data, but we may not want to be that restrictive. The reason I say that is the obspy read function now supports a very long list of standard formats.

That said I propose to define a "rawdata" or "raw_data" collection. Core attributes this collection should contain are:

npts - our standard number of points attribute
starttime, endtime - our standard name for time range of data
delta - our standard attribute for sample interval
format - string defining format of these data
treftype - 'UTC' or 'relative' as used in BasicTimeSeries (essential to support active source data)
auth - optional string defining data source of these data. Might be an experiment name or something generic like 'IRISDMC' or 'FDSN'
net,sta,loc,chan - required network, station, location code, and channel code for miniseed data. Would be optional as active source data do not use this concept
dfile - file name for reading
foff - file offset to for seek
filesize - length of data to be read in bytes (see below - may not be required)
mover - we need some kind of abstraction to provide a mechanism to allow not all the data to be spinning locally as a dfile. e.g. at tacc we could presumably put the raw data in a tape archive file (at least I think tacc has an archival system like that - IU does). For now this is a place holder as I think this is a concept that would require way more than one attribute.

A couple of key discussion questions we should consider in our call today:

Should this collection only support file reading or should we allow a url? obspy has support for a url reader, but I'm not sure what it really does. Maybe we need a mechanism to make it easy to distinguish between a file read and url. Maybe if a document has dfile defined we use a file reader but if a "url" attribute is define used a url reader.
I am not sure if obspy's reader can handle an offset read. It appears to me if you point their (obspy) reader at a file it just eats up the whole thing. That is unfortunate if true, but the use of the filesize attribute above would solve that problem. Our reader from raw could copy filesize bytes to a /tmp file and then call their reader on the /tmp file. Hopefully that won't be necessary but it may be.
The mover attribute is a generic concept that requires more thought. We should discuss if we want to go there now or put it on the shelf as a future development.

This model will require an indexer program that would look through a collection of files and build the raw_data collection index. The solution I have right now that could be turned into a working program quickly is to use antelope's miniseed2db to build tables wfdisc, snetsta, and schanloc. From these we could build this index easly for miniseed data. Other formats would be a different matter. I need to develop something like this with some level of functionality to handle the usarray data we are planning to use for benchmarks, so a key discussion point is what functionality we should plan for the initial development.

mspass-team / mspass

schema development #33