mspass-team / mspass

Massive Parallel Analysis System for Seismologists

https://mspass.org

BSD 3-Clause "New" or "Revised" License

32 stars 12 forks source link

Database api design #78

Open pavlis opened 4 years ago

pavlis commented 4 years ago

In our last call we discussed the fact that something we need to finalize soon is our database api for python. Ian, I know you said you would start this but in working on raw data handling this week I was conscious of this imminent need and thought I'd pass along some insights I have gotten as a newbie to MongoDB and implications for mspass.

The syntax of MongoDB's query language is weird and will be a barrier to many people using our package if we don't give them a crutch to walk with when they are getting started. Maybe a better analogy is one of those wheeled baby carts where a baby can get around before they learn to walk. I think good examples of what we need to build on are the waveform readers and writers. They require only the user know how to create a MongoClient and construct a Database object. That kind of abstraction of common database needs should be our primary design goal.
We can't cover everything. We need to come up with a set of initial common queries we can implement to simplify common needs.
Once someone sees how the db works, they will be more motivated to dig into the weeds of MongoDB's API. We probably won't need a lot of help on that, but we might want to design a set of functions/methods that have a "query" argument that would be the typical py::dict used by pymongo.
Here is a list of common functionality I think might be necessary/useful:
1. Return a cursor to wf with data sorted by some key (might default to source_id)
2. Return the result of grouping wf in a form that could be used to construct an RDD of ensembles. The atomic object level for some operations are ensembles so we either need a way to tell an algorithm how to group data or pass ensembles to provide a clean map operator.
3. We will need a preprocessing dbverify program that will apply generic tests of db integrity. This functionality will be essential to validate large inputs. What and how to structure this is not simple. My perspective is it should be extensible with a core set of tests we build and a mechanism to allow a user to build and apply custom tests.
4. We need some ways to print and/or plot some common metadata and/or summaries of metadata properties. Examples are: station coordinates or a map of same (may be able to use obspy's plotting for this), source coordinates (same point as stations with obspy), summaries of waveform properties (list of unique stations, list of unique sources, waveform time coverage plots, and likely others. This isn't an immediate need, but we will need to develop some of these along the way to validate test data. I've already run into many cases where it is hard to figure out what is going wrong because the data volume can't just be listed.
5. Certain subset queries are common and we might want to write a few python functions that return dict that can be used as a query on a collection. Here are a few I know of immediately: (a) soft test to match a source location (can't use equality for multiple reason), (b) soft test for a match on a site location, (c) a combination subset based on one or more of the SEED code (net,sta,chan,loc) and an (optional) time window, (d) a hard match for source document with a source_id, (e) a hard match for a site document with site_id, (f) a range of geospatial queries (box region or a circlular region) using MongoDB's archain geospatial query mechanism.

Enough for now. May add to that list as I think of other common examples.

JiaoMaWHU commented 4 years ago

Hey Gary, Ian and I are discussing the specific requirements for APIs and collection schema in the design doc. We will include the functionality you mentioned there.

pavlis commented 4 years ago

Here is another thought on this issue: Can we abstract creation of a spark rdd with a mongodb reader? In particular, suppose we have a MongoDB cursor created through this kind of common construct:

dbwf=db['wf']
curs=dbwf.find(query)   # query a passed argument or constructed above as python dict
rdd=build_rdd_from_cursor(curs)

where the name "build_rdd_from_cursor", of course, is silly and only ilustrates the idea. The point is each document would naturally define one piece of "data" that would be processd. Then readers could immediately be parallelized that took the document defined by each cursor member with a map operator (map takes rdd with mongodb docs as input and returns a TimeSeries, Seismogram, TimeSeriesEnsemble, or SeismogramEnsemble.

Is that feasible? Is i a "well duhh" statement?

pavlis commented 4 years ago

Had a few thoughts this morning on this topic assembling some things for our meeting in a few minutes.

I think we need to start with a common design principle in object oriented languages that a class should encapsulate a key concept and not mix concepts. With that in mind, I think we should simplify the class we have called Database. We should decide what concept Database should abstract. I would recommend it focus on waveform data interactions.
Operations to the auxiliary collections of site, source, channel and newly proposed raw_data, raw_data.ensemble, and arrival be subclasses of Database. Sensible names might be things like SiteManger, ChannelManager, etc. where each class would focus on a particular collection. If we do that the self for these objects should be at the collection level while Database is what the name implied - a top level pointer to Mongo
I think the constructor for the Database class should be simplied to use like this: db=Database('usarray2012') where usarray2012 is the database name. I say this because the existing class includes client as an argument so every script I write to use this class needs client=MongoClient(). The client arg should be defaulted to that line.
A completely different issue is that I think we need to change the names for the waveform handlers. The english works "load" and "save" have a meaning that is too close and will be confusing, I think, to readers. Recommend Database have a "readwf" and "writewf" method that automatically handle branching for all supported types. i.e. TimeSeries, Seismogram, TimeSeriesEnsemble, and SeismogramEnsemble.

pavlis commented 4 years ago

In working on the raw data handling code I ran across a few things that I think should impact our design:

We should consider the use of the period in attribute names carefully. I ran across a context where names like "source.lat" generated an exception. In other cases it does not, and I think I know why. I haven't fully tested this, but the documentation suggests if you use a set of names like source.lat, source.lon, source.depth, and source.time when that is translated into json/bson it is implicitly translated to a doc that would look like this (with a couple additions to show the structure:
```
sta AAK
station
{
lat 22.4
lon -110.4
depth 44.0
time 18989789790789790.0
}
```
That could be both confusing and I learned is error prone as I got a mysterious error I think was created by that translation. Hence, I recommend we not use periods in attribute names like that because I think it could create some mysterious errors.
In contrast I think using periods in collection names can be a good idea. It can be used to define what are called subcollections. In the database design document yesterday I suggested we have a seedwf.raw and seedwf.ensemble collection. On further thought I wonder if we shouldn't have it this way:
```
wf.seed   - collection indexing one net-sta-chan-time period per document
wf.seed.enemble - collection of seed data with data bundled into ensemble files.
wf.processing - collection used for assembled data sets in processing
wf.segy - collection that could be used in the future for segy data
wf.xxx - collection for data in format xxx
```
Here is a quote from the O'Reilly book I have by Bradshaw et al. on this topic:

One convention for organizing collections is to use namespace subcollections separated by the . character. For example, an application containing a blog might have a collection named blog.posts and a separate collection named blog.authors". This is for organizational purposes only - there is no relationship between the blog* collection and its "children" ...

I've found mongo accepts that kind of name find an you can even use this kind of construct in pymongo:

dbwf=db.wf.seed

and it resolves correctly. (You can also say dbwf=db['wf.seed'] and it works the same.) Hence, I think we ought to consider the potential organizational advantages of subcollections.

I reiterate for this record a statement I put in the design document yesterday. That is, I think we should not impose MetadataDefinitions restriction on preprocessing collections. All foreign formats will have a namespace mismatch and the natural namespace for the raw data collection is some dialect of the names for each concept. (e.g. the segy namespace has many versions of names for the same header position. e.g. I've used sierraseis, seismic unix, petrel, and promax and all have variations on the namespace for header slots) MongoDB provides a potentially powerful way to easily assemble a large dataset from original data stored in multiple formats. We might think about a general solution for namespace translation from a raw collection to MsPASS. i.e. it should be quite possible to define a generic alias structure that describes how to translate format XXX names/slots to the internal MsPASS namespace. Not something we need immediately, but our design should allow that.
I don't think we should attempt ProcessingHistory preservation in data assembly. At least we should put that on the shelf for a while. It is feasible, I suppose, but the complexity of variable formats and variations in how people assemble data would, I think, largely make that a waste of time. I know this because I tried to think about doing this with miniseed data that map fairly cleanly to TimeSeries, but it is still problematic,.

wangyinz commented 4 years ago

I am confused about the issue in the attribute name. According to our schema design, names like source.lat should never appear in the database. We only use that when reading normalized attributes into Metadata so that they appears to be belonging to a embedded document. When saving the Metadata, these attributes of dot notation will be put into the corresponding collections, and the collection name and dot will be removed then.

pavlis commented 4 years ago

I now have a very good reason why we shouldn't use names like site.lat and source.lat. A user may think this is ok, but if you try to naively save dict created from a TimeSeries or Seismogram Metadata container with a line like this:

doc=d.todict()
db.insert_one(dod)

You can get this kind of error if site.lat is defined:

InvalidDocument: key 'site.lat' must not contain '.'

I thus think we should only use names like site_lat for compound names.

Alternatively, if we want to preserve _ as a seperator for virtual ids like site_id (appropriate since the ObjectID is always referred to with the key '_id') we could use a different separator for compounds. e.g. we might use 'site:lat', 'site;lat', or 'site#lat'. Any other standard ascii characters, I think, could be trouble due to other uses as, e.g. operators. I think ":" would be the best. It is sometimes used as a separator while the others are rarely used that way.

pavlis commented 4 years ago

Ran across another practical thing we need to deal with in data handling. You guys may be working on a solution to this already, but let me put this out there anyway. A good way to see the problem is to see this document one gets if one naively dumps an obspy Trace stats array to MongoDB when the Trace object was created by their seed reader:

{'_id': ObjectId('5fbabf30cf250627eaf3039d'), '_format': 'MSEED', 'calib': 1.0, 'net': 'BK', 'sta': 'CVS', 'endtime': 1356947180.744538, 'mseed': {'dataquality': 'Q', 'number_of_records': 40, 'encoding': 'STEIM2', 'byteorder': '>', 'record_length': 4096, 'filesize': 685951488}, 'starttime': 1356943580.7695382, 'sampling_rate': 40.0, 'loc': '00', 'chan': 'BHE', 'npts': 144000, 'delta': 0.025, 'site_id': '5fb2e9d66fd117975915a927', 'source_id': '5fb2ea046fd117975915bc05'}

The issue is relics that are extraneous once the data are loaded like number_of_records and encoding. One mechanism is to use MetadataDefinitions to filter these out. I think that is what you may be thinking. It may also, however, be useful in the api to have a couple of lists with names like drop which are keys of attributes that if found are dropped on writes. The opposite would be force which would force a write even if MetadataDefinitions filtering would otherwise reject it.

I can see cases where drop would be necessary. e.g. I was just looking at the code for building a CoreSeismogram from 3 TimeSeries objects. Although I fixed it an pushed the change this morning, it would previously be easy to leave a relic vang and hang in such data. There may be similar bugs outside the window waiting to get in.

A user could use a force list to push attributes without having to mess with the schema definition we impose through MetadataDefinitions. I can definitely see a demand for this in research prototyping.

JiaoMaWHU commented 4 years ago

Hey Gary, I see what you are saying, Ian and I once discussed this, we agreed that any user-defined keys are allowed to be written into db. @wangyinz , Ian what do you think? do we change the plan?

wangyinz commented 4 years ago

My design could be totally wrong, but according to what I had thought of, by default there is no reason to drop any user defined keys when saving to database, because this will be a desired behavior for prototyping (no additional steps required to accommodate new attributes created by a new algorithm). MetadataDefinitions should only be used to filter out the aliases and ensure the basic schema that our MsPASS functions relied on is clean. Considering from this aspect, we should also provide the users an optional argument to exclude certain attributes. For example, in the attributes shown above, it will be the user's responsibility to exclude the whole 'mseed' key when saving to the database. Any thoughts?

wangyinz commented 4 years ago

p.s. I agree that we should change the . back to _. I will make that change right away.

JiaoMaWHU commented 4 years ago

Well, I think since we've already offered users the option to have their own metadata definition, then we shouldn't store any keys that are not in the definition.

pavlis commented 4 years ago

Perhaps a simpler way to deal with how rigid MetadataDefinitions is or is not is to have a boolean argument that we might call "save_undefined". When True a save would ignore the safeties and just save without testing. When false, which I think should be the default, attributes not defined in MetadataDefinitions would be dropped.

To be more flexible could have an argument that behaved like branch for a switch-case (why python doesn't have switch case is beyond me given how complete the language is but that is beside the point). i.e. we could have 'undefined_behavior' or some similar name. Could have options:

drop do not store undefined attributes (that should be the default to minimize unexpected behavior)
save_all opposite of drop (save everything but still impose alias mapping and type checking of defined attributes
exclude use an exclude list (that would require an additional argument that was a list of keys to exclude if encountered. That list should default to an empty list. If done his one would have to be handled carefully in this sense. You would NOT want to call clear on exclude key without first making a deep copy in case downstream processes need the excluded metadata. That behavior would be subtle, but it could be a useful way for processes to pass intermediate results that don't need to be stored in the database. An extreme version of that would be a large object like some 3D earth model that could, in principle, be moved around as metadata but it would be real evil if it saved many times.
force opposite of exclude where a list would be required to force a set of undefined attributes.

I think no matter what option we use we need a standard processing module clear_metadata with this signature:

def clear_metadata(d,clear_list=[],save_history=False)
    '''
    d - a MsPASS supported data object
   clear_list - list of keys to be cleared in d
   save_history - usual boolean controlling history logging
   """

That function is pretty trivial, but needed no matter what to clear debris and document that it was done.

Down the road we will also need to think about a general metadata manipulator that doesn't require python coding. A common concept many of our users are familiar with is "header math" processing. I won't dwell on this here because with a python command language such a feature seems a bit unnecessary but it might be useful to have a few basic operations like setting a constant, adding a constant, etc. The main advantage these would provide over python statements within the job script is they would need to be set up to add a history record.

wangyinz commented 4 years ago

So that actually goes back to our original design, which I don't think is bad anyway. What do you think? @JiaoMaWHU

JiaoMaWHU commented 4 years ago

Right now, the default behavior is only saving attributes that defined in the metadata definition, if Save all is asserted, we will also save attributes that are not in the definition. I think this solution actually works. If there are no problems, I will stick with this.

pavlis commented 4 years ago

Roger that - I think what you describe is a simpler implementation and clearer than the complicated thing I described earlier. Make sure you update the docstring to make the behavior is defined.

pavlis commented 4 years ago

Jinxin I like what you have done to improve db.py. I see, however, we have an obvious hole in the current implementation. A standard acronym in database jargon is CRUD - Create, Read, Update, and Delete. We have CRU defined but no D.

pavlis commented 4 years ago

I thought about putting this in the google doc on our database design, but I'm confident enough about the topic of this comment I think it appropriate to put it in this issues page.

The thing I've encountered and been pondering is how to handle an issue network operators deal with on a daily basis: multiple estimates of the location parameters for an earthquake. In CSS3.0 this is handled by using two relations called event and origin. There is one and only one event for a particular "seismic event" while there may be many origin estimates for the same "event". This has already bit me in raw data handling in trying to reconcile conflicting location estimates obtained from IRIS web services and estimates used by the ANF in their catalog construction.

With that background, I have this suggestion. I recommend we add this attributes to the source collection:

- name: alternate_ids
   type: list
   concept:  List of ObjectIDs of alternate source parameters for this event.
   read_only:  True

where I have used our yaml format for the definition. This assumes three things:

We can safely handle lists in our definition. Because this is in a auxiliary collection on would should not normally load this into Metadata anyway, but I think we could now if needed. This suggests we need another parameter in MetadataDefinitions that defines an attribute as do not load ever as metadata.
Given 1 I think the list could be actual ObjectIDs instead of their string representation, but that is an implementation detail.
The list should not contain the ObjectID of the document in which the list is contained. Self referential documents strike me as dangerous, but if we do this we need to establish a clear rule - the list either must always or never contain the ObjectID of itself.

Why this will be useful is it provides a generic way to define what origin is considered definitive for an analysis. Association function can establish whatever rules they want to look through the alternate origin list and decide which one to use.

Note also this will demand a means to define and set the alternate_id attribute. I would recommend we build a fairly generic time and space associator that would use something like the older db code for handling this issue i.e. specify a time and distance cutoff that defines points as equivalent.

pavlis commented 4 years ago

Ran across an issue in Jinxin's current implementation of save_data. I'm posting it here because there is probably more than one way to address this issue and I am not sure I know the best way as more of rookie python programmer.

The issue is that save_data is dropping some necessary metadata because they are marked readonly in the yaml schema file. The essential ones are net, sta, chan and loc. In testing my seed importer I discovered that created a problem unless I require a match for site, source, and channel to set the ids (those aren't marked readonly). In many ways that is right since once we have seed data indexed net,sta,chan and loc are baggage. The problem is in an import if I can't find the entry for a waveform I have drop it or write an ambiguous wf record.

I can think of two solutions to this problem.

Simply add a force argument to the save_data method in the Database class. force could be just a list of Metadata keys that should be bypass the read only test.
We could add one or more fields to the yaml file to support the general problem of how to handle keys that can appear in multiple collection (normalized or no). Since it seems we are likely to need a revision of MetadataDefinitions in python this seems an appropriate time to consider this. I am thinking maybe just one additional field is needed:
```
collection : site   # we already have this
used_in : wf
```
where collection is the defining table and used_in defines collections where this key's value will be automatically saved.

Right now I tend to think 1 is the better solution as it is simple and matches the model that you can put anything you want in the metadata provided you do your own housecleaning on those attributes. For that reason we need a force type functionality anyway.

wangyinz commented 4 years ago

Actually, I was thinking of something similar to 2 yesterday. Since we are re-implementing MetadataDefinitions in Python, we probably should think of a better design. There are actually two different things needed, one is the Schema for the database, and the other is the Schema for the Metadata in TimeSeries and Seismograms. For the first one, we probably should utilize MongoDB's schema validation support so that we can ensure the users won't be able to alter the important collections (i.e., wf, source, etc) with native MongoDB API. The revised MetadataDefinitions will be used for the second one only.

However, we don't want to define the two with two different yaml files, and we probably don't need two neither. Right now I'm thinking only defining the database schema in the yaml file and we can then add a section in the yaml file to help derive the metadata schema from the database ones. For example, we can have something like below:

wf:
  - name: delta
    type: double
    concept: Data sample interval in seconds
    aliases: dt
  - name: site_id
    reference: site
  - name: channel_id
    reference: channel
  - name: source_id
    reference: source
    optional: true
  ...
site:
  - name: _id
    type: ObjectId
    concept:  ObjectId used to define a particular instrument/seismic station
  - name: net
    type: string
    concept: network code (net component of SEED net:sta:chan)
    aliases: network site.net channel.net wfdisc.net
  - name: sta
    type: string
    concept: station code assigned to a spot on Earth (sta component of SEED net:sta:chan)
    aliases: station site.sta channel.sta wfdisc.sta KSTNM
  - name: loc
    reference: channel
  - name: lat
    type: double
    concept:  latitude of a seismic station/instrument in degrees
    aliases: STLA site.lat
  ...
channel:
  - name: _id
    type: ObjectId
    concept:  ObjectId used to define a particular component of data
  - name: net
    reference: site
  - name: loc
    type: string
    concept: location code assigned to an instrument (loc component of SEED net:sta:chan)
    aliases: location site.loc channel.loc
  - name: chan
    type: string
    concept:  channel name (e.g. HHZ, BHE, etc.) - normally a SEED channel code
    aliases: channel KCMPNM channel.chan wfdisc.chan
  - name: hang
    type: double
    concept:  Azimuth (in degree) of a seismometer component - horizontal angle
    aliases: hang CMPAZ
  ...
source:
  - name: _id
    type: ObjectId
    concept:  ObjectId used to define a particular seismic source
  - name: lat
    type: double
    concept: Latitude (in degrees) of the hypocenter of seismic source
    aliases:  source.lat EVLA origin.lat
  ...
Metadata:
  - name: delta
    collection: wf
    readonly: false
  - name: site_id
    collection: wf
    readonly: false
  - name: net
    collection: site
    readonly: true
  - name: site_lat
    collection: site
    readonly: true
  - name: chan
    collection: channel
    readonly: true
  - name: source_lat
    collection: source
    readonly: true

While writing above, I realize that we also need to figure out a way to handle the difference between TimeSeries and Seismogram (e.g., whether the chan should be an array or not, whether a tmatrix should be defined). I guess the question is: are we going to mix TimeSeries and Seismogram in the wf collection? I guess we could do that and we only need to make channel_id an array for Seismogram and make tmatrix an optional attribute. We probably should also think about an ensemble.

pavlis commented 4 years ago

What you say above is totally reasonable. I trust your judgement to proceed sensibly on that one.

As to the question of how to handle differences between TimeSeries and Seismogram I think there are two approaches:

Handle this issue completely in the database.py code Jinxin is writing. In that case, the schema just needs to define the attributes in a MongoDB world
We could put Seismogram and TimeSeries objects in different collections. That, in fact, might not be a bad idea as it would remove the need for an attribute to distinguish them that and a test every single time a datum is accessed. The more I think about it the smarter I think it would be to have a wf.TimeSeries and a wf.Seismogram.

pavlis commented 4 years ago

In helping Jinxin stamp out some bugs in his code I realized we have a generic state complexity issue. By that I mean I think we need to have a clear understanding of what different states a TimeSeries or Seismogram object may be in when the save is attempted. Let me try to organize that as it is a general design issue for MsPASS.

First, there is state that can be related to concepts we preserve in ProcessingHistory but which a writer may have to infer without reference to ProcessingHistory:

Data can be an an ORIGIN or RAW state. There are two cases a writer has to handle if the data are in this state:
1. If these data have never been saved before, which is the case I'm dealing with at this moment trying to import SEED data, they have to be handled specially.
2. If the data were previously saved, there is no point in replacing the waveform data if they haven't been altered, BUT it is possible other components of the data are altered (more on that momentarily).
Data are what ProcessingHistory would call VOLATILE. Saving of data currently VOLATILE by definition means save all as it means we are in a writer in the middle workflow done to save an intermediate result or a writer at the end of the workflow to save the final results.
If data are in a SAVED trying to save the data again would only produce a second copy. We need to allow for the possibility this could happen unintentionally with some algorithm that silently passes on data without doing anything to it while altering others. I can imagine that happening, particular for machine learning algorithms that might select subsets on the fly in an iterative algorithm. (I may be wrong on this as I only know enough to be dangerous on this topic, but ML may not be the only case this could happen).

Next is the issue of state caused by what elements of the object need to be saved? A complexity of our design is that both Seismogram and TimeSeries have four distinct pieces of data that can be altered independently and are based on completely different concepts:

The waveform data is the core concept we are supporting. Some but not all algorithms will alter these data.
What I've called Metadata, which here means all the stuff that is related to the waveform data that isn't waveform data. There are algorithms that alter one or more Metadata attributes, add new attributes, or delete attributes (pretty much CRUD in a container).
The ProcessingHistory object level history data. A complexity of ProcessingHistory is at present we have aimed to not demand it be enabled in the framework. That means, an additional state issue is that ProcessingHistory may be null or (worse) incomplete and we have to infer that unambiously.
The error log is a different concept we have to treat differently. A fundamental issue of error log is we normally hope that elog will usually be totally empty. As Jinxin has already implemented, however, in python it is trivial to detect the empty condition like this:
```
n = elog.size()
if n == 0:
do some thing when elog is empty (usually return or set some return code)
```

Finally, there is one other overriding state: is the data marked dead? This is the easiest one to handle as it has only two branches for a writer: (1) if the data are live continue on into the save logic, (2) if they are dead only save the error log and (if not empty) the ProcessingHistory data.

I hope this can provide a starting point to understand the issue we have to solve for handling this complexity in the writer Jinxin is working on. I hadn't fully appreciated this until I saw his current implementation and realize how difficult this problem is because of this complex state issue.

I think we first should all make sure we understand and agree on the points made above. i.e. do I have this right or did I omit something or add something unnecessary? Then I think we need to enumerate the specific states we can expect the writer needs to handle one way or another.

pavlis commented 4 years ago

I had another thought on the Database design. It maybe should be somewhere else as it is really a schema issue, but it could impact the API because of the different concept it incorporates.

The concept here goes back to something I remember we discussed in the 1980s when I was a member of the first IRIS Data Management Center standing committee. IRIS paid a bunch of money to a consulting firm to address what kind of queries seismologists would likely make for a data center. IRIS got their money's worth as they gave an answer that remains as true today as it did in the 1980s. The answer is that research data is generally directed inone of three concepts I can describe to us as these: (1) common source gathers (e.g. earthquake source studies), (2) common receiver gathers (certain kinds of structural studies) and (3) data linked by some relationship to a point on the earth not defined by a source or receiver. Item 3 is the thing we haven't been thinking about we need to support. In seismic reflection processing this concept has a whole range of names with different approaches used to define what it is. The same is true with earthquake data. e.g. what Shawn does using PP precursors uses source-receiver midpoints (well approximately - not exactly for deep sources but that is a detail). In our (Ian and I and other of my students that is) use of plane wave migration we use a source associator that links data from multiple sources to an average (centroid) position on the earth.

With that background I propose we define a geo collection. We could be more verbose and call it something like geo_points, geographic, terra (if you like latin), or something else you might think more descriptive. I suggest we use geo because is is simple and relatively easy to remember like wf. Here are is a core list of attributes I think documents in the collection could contain:

_id - the Object ID always inserted by MongoDB that could be put in wf as geo_id when used
latitude, longitude, depth - define a point on the earth of interest
point_type - some descriptive string of what this point means for selection
associations - optional list of wf_id values for waveforms linked to this point.

The associations idea is redundant if some associator puts geo_id values in waveforms found in wf. These would provide the reverse mapping from geo to wf. Not particular essential for any processing as all on would need to do to form geo_id ensembles is sort a find operation to drive processing by geo_id.

pavlis commented 4 years ago

Writing the previous comment reminded me of a related topic best broken out into a different box. That is, our API, I think, needs an ensemble reader that can be used to assemble an RDD of data grouped by one of the id keys: source_id, site_id, or the new geo_id proposed above.

wangyinz commented 4 years ago

Yes, I agree that we need to make clear about the different states of data object for the database API. I have some comments below:

First, there is state that can be related to concepts we preserve in ProcessingHistory but which a writer may have to infer without reference to ProcessingHistory:

The ProcessingHistory object level history data. A complexity of ProcessingHistory is at present we have aimed to not demand it be enabled in the framework. That means, an additional state issue is that ProcessingHistory may be null or (worse) incomplete and we have to infer that unambiously.

I think we probably shouldn't rely on the ProcessingHistory to infer the state exactly because of number 3 here - they were by design an optional component. I think what we have right now which handles the raw data differently from the processing data with different modules works fine. I guess the only issue will be handling the "SAVED" state. I tend to assume the users need to be conscious of what they are saving, and if it is a duplicate, it will be a duplicate. Nonetheless, duplicates won't break the database since we will assign a different _id to it. One way to minimize the impact on a clean data set is probably default the save to a different collection dedicated to processing like a "scratch.wf".

Finally, there is one other overriding state: is the data marked dead? This is the easiest one to handle as it has only two branches for a writer: (1) if the data are live continue on into the save logic, (2) if they are dead only save the error log and (if not empty) the ProcessingHistory data.

I am not sure how we can save the error log and ProcessingHistory without a corresponding record in wf. Currently we designed the wf_id -> history_id as a one-to-one mapping, and the wf_id -> elog_ids as a one-to-many mapping. I am not sure how the history or elog document can be queried without an associated wf_id. So, maybe we should just ignore the deads, but print some warnings on the number of dead data being dropped in a database save?

wangyinz commented 4 years ago

btw, I concur on the geo collection design.

pavlis commented 4 years ago

A couple replies to your reply:

Using ProcessingHistory to infer state. I agree we have provide a mechanism to save error logging without processing history enable. I stated it badly, but my point was the database writer has to make valid inferences of state WITHOUT ProcessingHistory. Let me emphasize I wrote that comment to enumerate all the possible states I could think of that we had to deal with. I did not give a solution because (a) we need to be as sure as possible we have all the possibilities defined so we don't finalize a design that causes problems and (b) I don't know for sure how to solve the problem.
I completely disagree about how to handle data marked dead. Data automatically or manually killed will be the most important data to log as killing represents the most serious error around short of an abort from something like a seg fault or malloc failure. I suggest we redesign the schema or what goes into elog to solve the problem you note. I have a suggested solution in the next paragraph that I think will include both to those - i.e. a schema change and a change in what is saved.

First, let's look carefully at wha the ErrorLogger object has defined in the existing API. We can use one or more of the following methods to retrieve the contents (comments stripped):

  void set_job_id(int jid){job_id=jid;};
  int get_job_id(){return job_id;};
  std::list<LogData> get_error_log()const{return allmessages;};
  std::list<LogData> worst_errors()const;

Noting a minor blemish in the current include file is it is not properly organized. I usually am more careful about listing the order of methods in a class description - ErrorLogger is a mess. Jinxin has used these correctly calling the first two to get the job info and then working through the LogData list generating one document per entry in the list of LogData.

So, let's consider what states the data may be in when we are trying to save elog this way. They are:

If we are saving live data and we a save done earlier in the same function (Jinxin's algorithm) wf_id can be saved and we have not problems with that approach.
If saving the waveform fails we have a very different problem and all we can do is abort at the python level and post a abort message. I don't think we need to do anything to handle that possibility because python can handle it.
Data was marked dead and carried forward until a save happening later in the workflow.
We are aiming to prune the dead branches (data) from the data set. This is a function we need to define in the Database class or as an independent function. That module is critical if a processing step is expected to do a lot of pruning.
There are independent states of whether or not ProcessingHistory is valid or not. All of this goes away if ProcessingHistory is complete as we could just link elog entries to saved history documents. We probably should at least stub a mechanism to handle data if history is on or just do it. I'm not sure how much effort we should put on this right now but we need to at least include this possibility in the design as an alternate workflow that preserves kills.

Have I missed anything? Is there another state the data can be in we need to consider?

I thought about maybe using alternate keys defined by a finite set of Metadata, but couldn't think of a way to do that without creating something cumbersome and/or awkward. I come down to a simple suggestion: I propose we write the Metadata AND ErrorLogger content to a collection we can properly (and cutely) call "graveyard", "boot_hill", "cemetery", or some other memorable name. These entries should NOT save the waveform data but only include Metadata and ErrorLogger data. If you want to continue halloween style theme call these the skeleton of the data. It might be sensible to put the two pieces in two subdocuments with the tags metadata or Metadata and error_log. The first would look like a wf document entry and second would look like a elog entry. What do you think of that idea?

pavlis commented 4 years ago

An issue that to me is an unknown in dealing with dead data is how spark or dask will handle a member of an rdd (forget what dask calls the equivalent) that needs to be pruned and throw away? I have never seen that discussed in documentation I've read for Spark, but that has to be a common feature for any system that uses one of these schedulers. All need some mechanism for gracefully handling errors. What is it?

wangyinz commented 4 years ago

Yeah, I realized that we never made clear what a dead object means. There are actually two distinctively different ways to make an object dead: killed due to some quality assurance process, or killed due to some unrecoverable errors. The data in the former is probably good and it does not matter much what the elog and history of it were. The latter's data is likely problematic and users will want to inspect the elog and history to figure out what error has occurred.

Right now, the two scenarios were mixed and we won't be able to tell easily whether a dead object is killed on purpose or unexpected. Now thinking more about it, I guess we probably shouldn't mark scenario one dead. Instead, any quality assurance alike algorithms should act like a filter in the MapReduce term, such that they should simply drop the unwanted data objects. Our global history mechanism should be able to capture such as filter operation by recording the list of wf_id that goes in v.s. comes out. It is not necessary to record the object level history for the dropped data.

Then, for scenario two we definitely want to record the elog and history. I was wrong on the statement I made above about the wf_id -> history_id and the wf_id -> elog_ids mappings. I neglected that we actually have a global history record (yet to be designed) to link to the history_id and elog_id of a killed object. Therefore, I think it is actually OK to save the history and elog of a dead object without saving the object to wf. This way, we don't have to have a "graveyard", which seems could complicate the database structure. The metadata of a dead object is probably not as helpful or important as the elog and history. We might optionally save the metadata to elog if really necessary since the MongoDB is flexible enough to allow that anyway.

pavlis commented 4 years ago

Progress on this point. You are right that there are two ways data can be killed - with an error condition or intentionally. However, to build a reproducible framework, which is the purpose here, both need to be logged in a way that would allow recreation of the same sequence. Manual kills, in particular, must be preserved as they represent a huge cost and are a major fact in irreproducible results.

Some followup points:

I agree the links with ids should not be required BUT that demands we always preserve enough information to identify which datum was killed and how elog (must always be saved) and history (if saved) are linked.
I think the simplest solution to the previous point is to just always save the full content of Metadata to elog for any killed datum. If it isn't there it isn't recoverable anyway.
An axiom for ALL MsPASS algorithms must be: "If you kill it log it why". Any experienced programmer and data processor knows this, but it should be stated clearly in our documentation on how to write a MsPASS function.

wangyinz commented 4 years ago

Yeah, I think that's a good design.

Now, this is a follow-up on the previous discussion of reimplementing the MetadataDefinitions. After struggling for a while, this is the design (and implementation) that I come up with: https://github.com/wangyinz/mspass/commit/4e1483eaa3b975b36137345deb10b0df3c592583. Note that I was working from @JiaoMaWHU's branch just because we didn't have the db submodule separated in the master. This implementation pretty much follows the discussion we had several threads above with some minor tweaks. The description at the top of the mspass.yaml file is pretty much the whole idea. Now we have the schema for the Database separated from that for the Metadata container. We might even put them in two separate files, but I think it unnecessary.

The major issue I have not sorted out is how we should handle the wf_Seismogram collection and the Metadata for Seismogram. Right now I am inclined to define all its channel related attributes as array type so that we have a mirrored structure with the TimeSeries. They could be defined optional so that we don't introduce unnecessary baggage. What do you think?

BTW, in order to make the Schema class clean, I have to use "_" instead of "." for the collection names. This is because Python will falsely interpret the dots when accessing (but not for defining) attributes defined like this: https://github.com/wangyinz/mspass/blob/4e1483eaa3b975b36137345deb10b0df3c592583/python/mspasspy/db/schema.py#L207 Therefore, I suggest we switch to the notation for the other collections as well. I guess it actually works better for consistency since we are already using "\" for the attribute names.

pavlis commented 4 years ago

Response:

Separate yaml files - don't do that unless we have to. I see no obvious reason that should be necessary, but it may appear when we have a complete database implementation.
I think the best solution for TimeSeries is to make the orientation information optional. There are many examples where orientation is baggage. e.g. in the old days there were lots of vertical only channel systems in earthquake recording. In active source data that remains very common. If you have a data set with only verticals forcing orientation would be very annoying and cumbersome. Seems one way to define orientation is actually to allow multiple descriptions since orientation is a generic concept. The css3.0 convention is one choice in JSON notation like this:
```
orientation_method geographic
hang 90.0
vang 90.0
```
or
```
orientation_method cartesian
normal_vector [1.0, 0.0, 0.0]
```
where the idea is to add orientation_method that allows multiple ways to store that information to add flexibility.
Your use of _ confirms my experience. I hadn't thought it would create that many problems with collections, but given what I've seen I'm not surprised. This is kind of a syntax collision between Mongo shell syntax any pymongo. e.g. in mongo shell you can say things like dbsite=db.site or `dbsite=db['site'] equally while in pymongo this gets confusing if you do something like this:
```
def dbfunction_example(db,collection='wf')
dbwf=db.collection # I am pretty certain this will create an error
dbwf=db[collection]  # this I know works because have used it many times
```

wangyinz commented 4 years ago

I assume by TimeSeries you actually mean Seismogram. Because TimeSeries is single channel, the channel information should be fine there. The Seismogram will have the issue you mentioned, and I think we should just make the association of wf_Seismogram with channel as optional. That should be fine for the orientation info because we always save the tmatrix to the wf_Seismogram collection, so no channel info is needed there. I am fine either way with the orientation_method addition, but I do think it is not that important right now since it won't matter much for the data we will be using.

pavlis commented 3 years ago

I have run across some issues I think we need to discuss here for the latest branch I pulled from Jinxin noted immediately above here. I took his branch and fixed a few bugs to produce a working piece of code that will take the raw miniseed data files I have from running obspy's downloading methods (the raw_data_handling tutorial output) and import the data as TimeSeries objects into the wf_TimeSeries collection stored in gridfs. I will not that in the process I fixed a serious bug discussed elsewhere in handling absolute and relative time standards. I also fixed the problem of automatically handling type mismatches and converting when possible. Those details are best left to comments on the pull request when I'm ready to commit this stuff.

There are two issues this exercise has revealed. The first was expected. To handle the full USArray data set I will need to parallelize the ensemble reader so the workflow can be parallelized. I currently takes several minutes just to crack one event files, do some modest preprocessing, and save the result to gridfs. There are around 10,000 event files I've downloaded so that would translate to something in the range of 100,000 minutes of wall time if done serially on this little machine I have here (that is about 70 days - feasible calculation but pretty intimidating to just get prep data). This illustrates well, however, the reason why MsPASS is needed to handle data of that scale.

The second issue is more the topic of this comment. The current Database class doesn't have docstrings explaining how this is supposed to work, but from the code I think I understand. I was thinking about writing the docstrings but before I do that I realize I need to be sure I understand the ideas correctly. Further, we may want to tweek the API a bit before this get set in concrete.

Let me start with the core data writer:

def save_data(self, mspass_object, storage_mode, update_all=True, dfile=None, dir=None,
                  exclude=None, collection=None, schema=None):

Let me first state what I understand all of these arguments are given the current method has no docstring:

mspass_object is obvioustly the data to be save
storage_mode can be 'gridfs','file', or 'url'. url is currently not implemented as I expected but I know gridfs works.
update_all controls how Metadata is or is not written to the database. Note I changed the default to True from what Jinxin's branch had. That is absolutely the correct default because otherwise users could mysteriously find they lost attributes they created in a workflow.
dfile and dir are used only in file mode and their use is obvious and needs no further discussion.
exclude is list of keys of Metadata to ignore in output. A strong need in the api and None is the appropriate default. More on this below, but right now there is an overlap with something Jinxin created called "skip_list" we need to eliminate.
collection is obviously the MongoDB collection name to write the Metadata documents into. It is a little weird that the default is None but the reason is that we made this decision to split collection into wf_TimeSeries and wf_Seismogram that adds that layer of complexity. Raises the issue of whether or not that decision was the right one or whether we should just use "wf" for processing data and use an attribute easily sorted out by a subset for a reader. No strong opinion on that point, but this is a bifurcation point where we should decide for sure if that decision is a good one or not.
schema is the new python version of MetadataDefinitions that is used for regularization of types and other issues that will come up below. The Database class of which this function is a method creates an undocumented default for this feature that really confused me when I first looked at this. The docstring must clarify that they should give schema ONLY if they wish to change the default behavior.

I have two suggestions for improving this implementation immediately:

The current implementation does not automatically identify attributes cleanly that are "normalized" and shouldn't be saved in a wf_x collection. Jinxin did this by creating skip_list, but hard wiring those that was is not a solution. Using defaults for exclude would be a better way to do the same, but it suffers the same problem with hard wired keys that will create a maintenance headache. The solution is to make more effective use of the schema class that by depending mainly on the readonly method and use exclude only as a the blunt instrument it is.
There is a hole in the current methods for handling ensembles that I have filed with a prototype I called _sync_ensemble_metadata. The previous version was dropping Metadata posted to the Ensemble's Metadata.
That is, recall that Ensemble objects have a separate, global Metadata container that is intended to be used to post attributes common to all members of the ensemble. The example data I was handling were "event gathers" so the data posted there was source information (source_id, source_lat, source_lon, etc.). Those attributes previously were dropped. That capability is obviously essential, but the topic for discussion has some minor potential side effects that could be eliminated. A key question for this discussion is if making that change is necessary? To clarify this some more, the way I did this was basically to copy the ensemble Metadata contents to each member before calling the internal method that saves the Metadata. That means, that with this algorithm the save function has a side effect we must at least document: the members are all modified by merging the ensemble's metadata to each member. That violates a lot of design guidance but I can't forsee how it would cause problems other than adding a small amount of memory bloat by the duplications to all members. On the other hand, it can be removed with some modest changes where we only work on a copy of each member's Metadata and merge the ensemble metadata with copies before calling the update_metadata method. The more I think about this the more I think we should do that as it is cleaner and the cost is tiny. We probably do need to make sync_ensemble_metadata an algorithm though along with the inverse that would be useful for a reader (i.e. something to extract and erase a list of attributes to be pushed to the global Metadata).

One different item is we may want to think about possible extensions to readonly. i.e. maybe it shouldn't be just boolean but describe a set of ways and attribute should be handled by readers and writers. ids are a case in point. Once set they are readonly in the collection where they originate but cannot be readonly when saved as a cross-reference in something like wf_TimeSeries. Hence, maybe the field in the yaml file set as "readonly: true" or "readonly: false" should be something that captures a broader range of options? Before we go down that road we should think more about what those options might be. If it is only the id problem we could just add a new boolean to make an attribute as an id AND identify the collection where it is uniquely defined.

JiaoMaWHU commented 3 years ago

Hey Gary, thanks for pointing this out. I just pushed a few huge commits to my branch which is targeted to solve the hardcoded skip_list problem. I'll look into your opinions more thoroughly once I merged the pull request.

JiaoMaWHU commented 3 years ago

Also, the doc is the immediate action I need to take once I finish the merge. Since my branch has changed, the 'schema' parameter no longer exists in the method. But I think your interpretation of other parameters is correct after a quick glance.

wangyinz commented 3 years ago

I have the branch merged. Note that we are still in need of the fix for exception handling. There is one caveat @pavlis: the latest commits in this branch was force-pushed, which means that you may not be able to rely on git to automatically merge/rebase with the local branch that you've been working on.

wangyinz commented 3 years ago

btw, a lot of the issues discussed above are already addressed after this review: https://github.com/wangyinz/mspass/pull/98#pullrequestreview-559278096

wangyinz commented 3 years ago

102#discussion_r553493929