Design of Tools for Building Datasets

mspass-team / mspass

Massive Parallel Analysis System for Seismologists

https://mspass.org

BSD 3-Clause "New" or "Revised" License

28 stars 11 forks source link

Design of Tools for Building Datasets #52

Open pavlis opened 4 years ago

pavlis commented 4 years ago

While waiting for IU sysadmins to install the new Totalview program for debugging mixed C and python code, which is what we have, I've been playing with obspy and considering how we might design a reasonably generic tool to assemble a dataset with current generation web services. We don't want to get too buried in this problem as there are too many special cases to cover everything, but I think we should strive to make a tool generic enough to cover a reasonable share of cases. A first example of that is my selfish interest is a focus on events. I consider the continuous data management problem solved with antelope or variants of a tutorial script in the obpsy tutorial. So, we should focus on what should be in a tool to handle assembling an event-based dataset. I want to bounce around some ideas before I try any serious implementation.

The first issue is to be clear what we need to assemble to build a raw dataset? Wasn't once true, but the issues are now well sorted out into three separate problems:

Defining what the events that define the dataset are.
Defining what seismic stations are to be used for the dataset. This is the first complexity as that relationship is dependent on the user's processing plan. The simplest is the event gather to a fixed array. After that it becomes a interwoven problem of defining a recipe saying what stations are needed for each event and this and the next item get interwoven.
Defining waveforms are to be used. This adds another layer over interwoven subsets. An example, is the one I've been struggling with for years of using the picks from the ANF to constrain which waveforms are worth handling.

So some key generalizations I think can help us make a tool like this most generic:

Constrain the problem to say step one is always to define the events that are to be handled. This is advised only for methods that are fundamentally event based. It will be a poor choice, for example, for a study of something like core reflection phases that require an algorithm more like sorting to common midpoint gathers. Let someone else solve that problem in the future would be my suggestions.
Generalize 2 and 3 into a second procedure that defines rules for how stations and waveforms are to be filtered to define what "the dataset" is.
The above could be reduced to a (potentially huge) list of net:sta:chan:loc and time periods. That is the lowest common denominator for a web service request, but we already know using that atomic request for millions of web service requests is not currently practical as it is way way too slow and can overwhelm he data center servers. More on this next when I discuss proposed initial solutions.

As the last sentence says, the next step is how we might actually implement a prototype. obpsy has made some huge advances to address these issues in the past few years. The core tool they now have to make this more efficient is the mass downloader. It provides some of the generic functionality but not all. It is, however, the clear starting point. Overall the process with need pretty much define each of the three items above:

Getting an event list is the first step for pure event processing. There are many models, but probably the best model for mspass is that the objective is to assemble the data to create what obspy calls a Catalog object. The web service method is described here. It should be straightforward to produce a small python program to build a Catalog from an antelope database.
I think for an initial development we should next obspy's mass downloader. That won't solve everything, but we should document that if the user needs a more exotic algorithm for defining station-waveform association they would need to develop a custom "Domain" object used by the mass downloader.

This strategy would reduce the algorithm to a loop over events with one web service request per event. Each request would append data as files into two directories: (1) the waveform would be created as one file per channel ala SAC which be default would be in one directory (big problem), and (2) station data gets pushed as stationXML format files in another directory (default is full response info, which is probably advised).

This strategy should work, but I know for absolutely certainty has two issues we need to resolve:

Putting all the waveforms in a single directory will quickly kill any file system. The solution to this is pretty easy - make one directory for each event and put waveforms for each event in a different directory. Alternatively, we could immediately push them to mongo. I think the later is a bad strategy as it would be best to not have to refine the recipe that tightly that defines the dataset. If the web service proves fast enough that would be an ok strategy, but the first tests I ran yesterday on the canonical all bb waveforms from all stations in the US for a fixed time period defined by one event took more than 10 minutes to complete. A dataset of 10,000 events is modest so that naive approach would take about a day. Add a factor of 10, however, and you get several weeks. So, for now I suggest the approach is assemble the data in a generous block first and then you have more ability to slice and dice it as you please.
I do not understand how obspy is (or is not) handling the problem of changing metadata with time. e.g. the gain of a station can change and nothing else changes. I don't see anything in their documentation that suggests they can handle that. It seems they just write a stationXML file for each net:sta:chan:loc id they find and do some (unclear to me) test to decide if they should update a file or leave the old one. I think for initial testing we should see what happens with a large request and see if we can sort it out by comparing what we get to the dbmaster tables supplied by ANF for usarray. I am not sure how we sort this problem out to have it work efficiently. I think this will be a development issue we will face. Opinion?

Finally: we should probably have the current version of obspy (1.2.0) in the master container and for all development. I didn't check when that version as released but I noticed the versions I had on all my machines were older until I did an update. I think there were improvements to the web service stuff in this new version.

pavlis commented 4 years ago

Brief followup here from our phone conversation for the record.

I looked at the MongoDB documentation and find I was wrong about BSON supporting an XML data type. However, I think the model for handling metadata in our source and site documents we discussed is still feasible and in fact prudent. Let me summarize it here for the record.

As now each seismic station would have one or more documents in the site collection. Similarly source information would be one or more documents in a source collection.
When duplicate data with variants exist for the same "site" or "source" the multiple documents would be distinguished primarily by having different ObjectIds. It would be a separate application to sort out which of multiple documents were the "right" ones for a data set.
Some fields for the documents in the site and source collection documents would be required. e.g. for site we would require net,sta,chan, and loc to be defined even if null along with lat, lon, and elev that could not be null.
An important optional component would be the contents of the XML format station and event information returned by FDSN web services. I think we should support one or both of the following ways to store that data: (1) a string dump of the xml, or (2) a serialization of the obspy Event or Inventory objects. I'm going to experiment with that a bit today.

Please add to this if you have any additional insights.

pavlis commented 4 years ago

Another item related to this. I just ran across this in the obspy 1.2 documentation. I think it is new in 1.2 and is the result of their Jane development. Here is the documentation page for the current features. Not clear to me at all what is needed to make this work. This appears to be the way they are handling variants in metadata for the same station or event. Look especially at the client section

pavlis commented 4 years ago

Wanted to show you this as I think it demonstrates this idea. This will work but it uses a dump algorithm to simply blindly write the contents into the database site collection without checking if the entry already exists. I can implement that pretty quickly and will try to do that today, in fact, but wanted to pass this along to you. The docstring says this but basically this function will create an image of an obspy Inventory object in mongodb pulling what I viewed as universally necessary attributes as named key:value pairs. The obspy inventory representation of the original xml file is stored with pickle.dumps. That should allow restoring the full inventory data to be faster than parsing the raw stationxml file. I've tested it with pickle.loads and it is lightening fast as we would expect. Anyway, here is the prototype code for you to play with if you choose:

import pickle
import pymongo
from obspy import Inventory
from obspy import UTCDateTime
from obspy import Catalog

def _extract_edepths(chanlist):
    """
    Parses the list returned by obspy channels attribute 
    for a Station object and returns a dict of unique 
    edepth values keyed by loc code.  This algorithm 
    would be horribly inefficient for large lists with 
    many duplicates, but the assumption here is the list
    will always be small
    """
    alllocs={}
    for chan in chanlist:
        alllocs[chan.location_code]=chan.depth
    return alllocs    

def dbsave_inventory(db,inv,firstid=-1):
    """
    Saves contents of all components of an obspy inventory 
    object to documents in the site collection.   No checking 
    for duplicates is preformed so every component of the 
    input inventory will generate a new document in collection.
    The design depends on the using a some other mechanism to 
    sort out pure duplicates from legimate station definitions 
    to handle issues like orientation changes of components,
    gain changes, etc.   We emphasize this point because this 
    function should be used with care for large data sets.  
    Careless application to the output of buld downloads 
    with obspy using web services could generate large numbers
    of unnecessary duplicates.   Use this function only when 
    the inventory object(s) to be saved do not have excessive 
    numbers of duplicates.  

    :param db: is a top level MondoDB database handle.  The 
      algorithm immediately uses it to lookup the site collection.
    :param inv: is the obspy Inventory object of station data to save.
    :param firstid: Set the initial value for site_id internal
       integer id key.   If negative (the default) it will be 
       extracted from the database.   If the existing site 
       collection is huge that can be inefficient, but 
       using anything but the default should be rarely 
       needed.

    :return:  integer giving the number of documents saved.
    """
    # This constant is used below to set endtime to a time
    # in the far future if it is null
    DISTANTFUTURE=UTCDateTime(2051,1,1,0,0)
    # site is a frozen name for the collection here.  Perhaps 
    # should be a variable with a default
    dbcol=db.site
    if(firstid<0):
        x=dbcol.find_one(sort=[("site_id", pymongo.DESCENDING)])
        site_id=x[0]  #test this - may be wrong
    else:
        site_id=firstid
    nsaved=0
    for x in inv:
        # Inventory object I got from webservice download
        # makes the sta variable here a net:sta combination
        # We can get the net code like this
        net=x.code   
        # Each x now has a station field, BUT tests I ran
        # say for my example that field has one entry per
        # x.  Hence, we can get sta name like this
        y=x.stations
        sta=y[0].code
        starttime=y[0].start_date
        if starttime is None:
            print('station ',sta,' does not have starttime define.  Set to epoch 0')
            starttime=UTCDateTime(0.0) # epoch 0 time
        endtime=y[0].end_date
        if endtime is None:
            print('station ',sta,' has no endtime defined.  Set to distant future')
            endtime=DISTANTFUTURE
        latitude=y[0].latitude
        longitude=y[0].longitude
        # stationxml files seen to put elevation in m. We 
        # always use km so need to convert
        elevation=y[0].elevation/1000.0
        # loc codes go may have different edepths, which 
        # obspy sets as a depth attribute.  This little
        # function returns a dict keyed by loc with edepths
        chans=y[0].channels
        edepths = _extract_edepths(chans)
        # Assume loc code of 0 is same as rest
        #loc=_extract_loc_code(chanlist[0])
        rec={}
        picklestr=pickle.dumps(x)
        all_locs=edepths.keys()
        for loc in all_locs:
            rec['loc']=loc
            rec['site_edepth']=edepths[loc]
            rec['net']=net
            rec['sta']=sta
            rec['site_lat']=latitude
            rec['site_lon']=longitude
            rec['site_elev']=elevation
            rec['site_starttime']=starttime.timestamp
            rec['site_endtime']=endtime.timestamp
        # needed and currently lacking - a way to get unique
        # integer site_id - for now do it the easy way
            rec['site_id']=site_id
            rec['serialized_inventory']=picklestr
            dbcol.insert_one(rec)
            nsaved += 1
            site_id += 1
    return nsaved

pavlis commented 4 years ago

I just noticed the docstring statement in that code that says the idea would be to filter such things out later. I've changed my mind and think that because this is a thing done only on small data sets it can be done inline with little to no pain. As noted earlier that is what I'll try to implement today.

pavlis commented 4 years ago

Made some progress on this yesterday. Have a fully functional function to load quakexml files into the source collection and stationxml files into the site collection. I decided to make the site version avoid duplicates based on the key of net:sta:loc:ondate:offdate i.e. it tests if a document those 5 keys already exist and will not add a new document if it has one with the same values. Those are the same keys as css3.0 and seem to be what FDSN is using. The source collection is different. Seemed to me to provide the same functionality of css3.0 that allows multiple origins for the same "event" we should allow duplicates. There I do no checking and assume is a secondary problem of linking data to the right document in the source collection. I am certain this will require some changes to normalize_source to deal with the new structure.

pavlis commented 4 years ago

I am close to finishing this, but have a question. Seems you wanted to rearrange the python module directories. Did you check that change into master? Also, where these belong in a tree of names for import is not so clear to me. I am almost finished testing six primary core python functions:

dbsave_inventory - Saves an obspy Inventory object (stations) to the site collection
dbload_inventory - the inverse of dbsave_inventory
dbload_stations - designed to get one or a small subset of events without the overhead of pickle.loads to build the obspy station objects contained in inventory. i.e. you get a raw return from mongo.
dbsave_catalog - save an obspy Catalog object to source collection
dbload_catalog - inverse of dbsave_catalog
dbload_event - get one event by evid or ObjectId.

These provide most of the functionality we should need for source and receiver data. There are some bells and whistles for embedded queries but those will probably be used only by you and me and users who know mongodb queries.

So where should I put these? Clearly somewhere in the mongodb directory if you retain that, but we what should the module they live in be called? Should we rename one of the existing files and put them together?

Note - I probably won't be able to act on this for a couple of days but would like to know your opinion to avoid excessive commits of later reorganizations.

wangyinz commented 4 years ago

That is a tricky question. So, I have the work-in-progress on a new branch named reorganize_python. The organization basically follows what I listed in #54. I think all you listed here belongs to mspasspy.db. If we put it there, we should omit all the "db"s in the function names. I think you should be able to just add these functions there as I am done working on that module.

pavlis commented 4 years ago

Ok, step one will be changing names like dbload_inventory to load_inventory. Might get this checked in today, but depends on how smoothly the last function (load_catalog) debug goes.

pavlis commented 4 years ago

That went faster than i expected. How do you want me to check this in? Thinking I should pull the reorganize_python branch and add the new functions to the db module.

Other question, however, is where should I install the test program? It has data files that go with it. Suppose I could automate the test and have the files downloaded by web services, but that could make the test really lengthy as the data files I use for stations came from an event download of all lower 48 broadbands. I can and could change the scale of the web service request and still have a reasonable program for a test suite. We DO need a mechanism for a clean test suite for python modules like this. I await your guidance or where to put this and how to handle this and other test suite programs.

wangyinz commented 4 years ago

I am done with the reorganizing, so it is safe for you to check in those changes to the reorganize_python branch. Test is the next thing I will be working on. For now, just check in what you have into the test directory, I will try to get all tests working under some modern framework. Not quite sure how to do that yet, but I at least know where to look for tutorials on that topic.

pavlis commented 4 years ago

I placed the new functions in the db.py module and a test program in mspass/python/tests. The revisions are on the reorganize_python branch. The test program has the verbose name test_inv_cat_dbfunctions.py.

This test program is particularly problematic because it is driven by two files created by using obspy's web services modules. I did not want to muddy up the repository with a set of data files that were messy and probably much larger than needed. It can and probably should be possible to reduce the test to a handful of stationxml files concatenated into a single test file and a quakexml file produced similarly. First step, however, is to make sure you can reproduce this test in some approximate form.

To run the test you will need to do two things. First, here the script I used to test obspy's mass downloader. It is a minor variant of the example (see here) they give in their documentation:

import obspy
from obspy.clients.fdsn.mass_downloader import RectangularDomain, \
    Restrictions, MassDownloader

domain = RectangularDomain(minlatitude=20.0,maxlatitude=54.0,
        minlongitude=-135, maxlongitude=-50)
# This is from obspy tutorial for Tohoku event

origin_time = obspy.UTCDateTime(2011, 3, 11, 5, 47, 32)
restrictions = Restrictions (
        starttime=origin_time - 15*60,
        endtime=origin_time + 3600.0,
        reject_channels_with_gaps=True,
        minimum_length=0.95,
        minimum_interstation_distance_in_m=100.0,
        channel_priorities=["HH[ZNE12]","BH[ZNE12]"],
        location_priorities=["","00","10"]
)
mdl = MassDownloader()
mdl.download(domain,restrictions,mseed_storage="wf",
        stationxml_storage="sitexml")

Note this will produce both miniseed files in wf and stationxml data in the sitexml directory. The test only uses the sitexml data.

The final thing is to create a test file of quakexml data from web services. I did that with some one line command, but I'm having trouble finding the exact tuturial page that shows the process. I think you will need to use the get_events obspy function with the right incantation to tell it to save the results to a file in xml format.

When that is done you will need to manually edit the file names in mspass/python/tests/test_inv_cat_dbfunctions.py to point at the data you downloaded. I reiterate that I think for a test suite program we should limit the size of the station and source xml files and just put two much smaller files in the tests directory. You will quickly learn that there is a large overhead in cracking the xml files. The data I was using had about 1200 stations and 2700 events. Both reads took 10s of seconds to complete. Not awful, but not something you want to do all the time. The db load routines are lightening fast by comparison even when reconstructing the full inventory and catalog objects with pickle.loads. The parsers for the xml data are clearly a nasty barrier.

pavlis commented 4 years ago

I just realized you may need to tweek the test program depending on how you select events. I have queries to test retrieving a single event that use source_id values form my example.

wangyinz commented 4 years ago

Cool! I can see the test dataset will become a problem if we don't standardize how we do the test for now. Ideally, we would want to include a small dataset that can be used to run all tests in the package.

I also noticed that it might be more appropriate to put some of the new functions in the io module as they seems to be something specific to importing ObsPy data. This might be a little confusing, but with current design mspasspy.db should be used for generic database operations, and mspasspy.io.XXX should be used to import and export data to XXX format. That's why there is a mspasspy.io.seispp, and we should plan for a list such as mspasspy.io.obspy, mspasspy.io.su, mspasspy.io.sac, etc. The stationxml and quakexml may also belongs there.

pavlis commented 4 years ago

I think nothing there belongs in io. The test program uses the obspy io modules to read xml files. I don't remember the names. The functions I put in db.py are save or restore obspy Inventory and Catalog objects. There is no file io on any of them.

wangyinz commented 4 years ago

Sorry that I didn't state it clearly, but I'm not saying all the new ObsPy related function belongs to the io module. I was actually thinking those functions may need to be taken apart. For example, the save_catalog function actually does two things: pickle and parse an ObsPy Event object to dictionary, and saving a dictionary to the database. The second is the generic database operation, and the first one is something unique to ObsPy.

The purpose of io module is not literally for file io, but really to deal with importing, converting and exporting different formats that's not ccore or Python built-in types. That's why I was thinking some of those functionality should be in mspasspy.io.obspy.

However, now having been thinking more about it, I see the most important question to answer first is to what degrees mspasspy should be dependent on ObsPy. If we treat ObsPy as the core of the package, then it make sense to put those functions under db module. If we see it as a third party package that is potentially swappable with something similar, then it should not be treated more different than seispp or sac, and we already have a mspasspy.io.seispp.

I actually think it may be reasonable to make ObsPy the core of mspasspy for now, as it really is powerful and handy for most of seismic data handling problems. Actually it would totally make no sense to not leverag ObsPy in mspasspy. The only concern will be if we want to generalize mspass to other domains later, the tight couple with ObsPy may become a baggage - not sure if that's really something to worry about at such an early stage.

Anyway, just want to see your ideas over the design of the modules. I am completely fine with either way.

pavlis commented 4 years ago

I am increasingly convinced obspy should be considered a core component for mspass. It has clearly become a common tool used by many seismologists so building on it will only promote what we are doing. We could, in fact, advertise one of the potential applications people can view for mspass is adding database and parallel functionality to obspy. They have a whole lot of algorithms implemented that people can immediately benefit from. I really do think it will make this more popular as treating it as an obspy extension. We already have huge obspy dependencies.

I probably should state how I think the new functions I checked in here should be used. The dbload_inventory should provide a way to build a comprehensive full response information for any data downloaded by web services. If you look at how it runs it checks each entry it finds for a match for net:sta:chan:loc and time::endtime. It will not add a new document to site if there is a match. This is nice because I am not sure how well obspy handles the nasty ondate:offdate problems they case as attributes startdate and enddate. Their documentation is not clear on this point, but I think a stationxml file has ondate-offdate embedded the formal allows multiple ondate-offdate combinations in one file. The model I'm testing (running right now and will for many days) is downloading yearly blocks of lower 48 broadband data over the usarray deployment period. I think that I can take the stationxml files produced form each year and build a clean site collection that has no duplicates. Will be a while before we know that for sure.

The way I see using this site collection is as the foundation for building clean receiver coordinate data in all wf collections. We'll have to produce a different version of normalize_size for that class of data.

Source data (source collection = obspy Catalog) will be a tougher nut to crack perfectly. The reason is that the general problem of defining when two source estimates are equivalent is not trivial at all scales. Sloppy solutions aren't hard, but the general problem is full of pitfalls.

pavlis commented 4 years ago

First, I like what you did to define a Database class for python for the mongodb interactions. This was prompted by something different though.

I wrote what I think will be a useful tutorial for Metadata, MetadataDefinitions, and AntelopePf objects. The tutorial is a jupyter notebook and the exercise has really made me a fan of that tool. What I realized, however, in describing those 3 core elements is that there is a hole that we need to fill. The hole is clear, but what I would like feedback on is how the hole should be filled.

I was writing a little section on aliases and how they are defined by MetadataDefinitions. When I looked back on this it made me realize how powerful this mechanism could be for building cross-connecting tools with packages like SAC or various matlab implementations folks out there have developed. What became obvious, however, was we need a fast and easy tool to do two operations:

Change all aliases to the standard names.
The inverse (apply a set of aliases)

We have a mechanism to do 2 during a load operation, but we don't have an easy way to do something similar in a workflow. Imagine, for example, someone wanted to use SAC for some sequence of steps and assume we developed some clean interface methodology to move data to and from SAC and to drive processing. If the workflow had steps before and after that needed the aliases cleared, however, the user has a problem.

The need is clear as a core algorithm, BUT how to implement and in what language is anything but clear. My first thought had two alternatives:

Modify MetadataDefinitions to have an _applyaliases and a _clearaliases (or some similar names). pybind11 wrappers would be trivial, BUT the function signatures would pass Metadata objects and I'm not totally sure how well the pybind11 interface handles objects like Seismogram and TimeSeries that use multiple inheritance relationships. I think it is ok, but it is a potential pitfall. Worst case we could write small wrappers that would do dynamic casts from TimeSeries or Seismogram and use the overload_cast mechanism in pybind11. I'm 100% sure we could make this work.
We could implement python functions to do the same task. Might be easier to maintain in one sense, but not in another. That is, I think one then would definitely need separate functions for TimeSeries and Seismogram objects while the C++ version could exploit polymorphism more cleanly (I think but then I know C++ at this point better than python).
We could define this operation as a whole new class either in python on C++. Might call it something like "MetadataTranslator". This would be the preferred route for sure if we had an existing user base and wanted to preserve the current API. In C++ is could be done through inheritance from MetadataDefinitions or be a new class. This is a classic OOP questions: is the thing a "is a" relation (inheritance) or a "has a" relation (new class).

Opinions and ideas?

wangyinz commented 4 years ago

I also have some thoughts about MetadataDefinitions from writing the tests. The issue I found is that in the MetadataDefinitions.type(string key) call, when the key is an alias, it won't return the correct type but throw an error into elog instead. This happens a lot when converting a dict from ObsPy into Metadata. So, if we have an apply_aliases function, this issue could be resolved. However, this function ought to be able to not only taking Metadata as the input, but also dict type of Python. If so, we probably will have to have it, or at least have part of it implemented in Python. If we ignore the dict type, then I think option 1 is the cleanest solution here.

btw, for the option 2, I don't think we need separate functions for TimeSeries and Seismogram. Although I don't know it immediately, I do think Python has similar way to handle such polymorphism.

pavlis commented 4 years ago

Since MetadataDefinitions is a hybrid python-C++ thing anyway I don't think there is any intrinsic problem having it handle a dict. I already know how to handle that although I'm pretty sure the "devil is in the details" in his case.

We may want the Metadata and dict versions to behave differently. I can see good reasons to have a method or procedure that returns a dict that will replace all aliases with the unique key, which is what we need for database saves. The other case is the one that came to mind to create this dialogue - needing to turn aliases on and off within a workflow. In that case, the method/procedure should modify Metadata in place. I think you are right that python through pybind11 will correctly handle multiple inheritance as long as the construct is not intrinsically ambiguous (e.g. a bad design that would have methods with the same name in two parents). I think that handling polymorphism of the main reason pybind11 require this confusing trampoline class to define the wrappers. I confess to not fully understanding why trampoline classes are needed and what they actually generate for a data structure to implement RTTI (Real Time Time Identification) in python. I just know it is essential to create a trampoline class if you have a base class with virtual methods or polymorphism will not work correctly on the python side.

Let me know your thoughts on this. I think I'll think about a design before I do any coding. Look for updates in this issues section.

pavlis commented 4 years ago

I decided to jot down my design ideas in this issues section. I'm starting with an assumption we should implement this as a revision to MetadataDefinitions. I think we could implement this with three and maybe only two methods:

int MetadataDefinitions::apply_aliases(Metadata& d, py::list aliases); with overload int MetadataDefinitions::apply_aliases(Metadata& d, list aliases);

Update a Metadata object d to apply alias names defined aliases. Note we only need a list because the is_alias and unique_id methods can test if an alias is valid and treat illegal values as an error. In this design it would return the number of entries actually changed. It might be better to return a py::list of any failures or an ErrorLogger object. I'm not sure the overload is needed, actualy, as I think an std::list in C++ is mapped directly to a py::list by pybind11.

The inverse should be something like this: int MetadataDefinitions clear_aliases(Metadata& d); Would convert all aliases in d to replace alias key with unique key for an attribute. Return number of items changes. As above, we might want something more elaborate like an ErrorLogger object or a list of any failures (can fail if a key is undefined in the MetadataDefinitions).

Note using the reference and modifying Metadata in place is essential. If an application for some reason need to retain the original they could make a deep copy easily with this construct:

mdcopy=Metadata(d)

where d could be an actual Metadata, TimeSeries, or Seismogram object.

wangyinz commented 4 years ago

That looks good to me. One thing to consider down the road is to have default lists of aliases for different packages, or we could just define something like int MetadataDefinitions::apply_obspy_aliases(Metadata& d); or int MetadataDefinitions::apply_sac_aliases(Metadata& d);. Not something important for now, but definitely will be handy for a lot of the users.

pavlis commented 4 years ago

G ood point, but think there is an a simpler solution for those two examples: we could build a pair of functions with this signature in python:

def sac_aliases()
def obspy_aliases()

Each would return a dict that could be passed to the apply_aliases method.

pavlis commented 4 years ago

I think I will write the methods mentioned above to return an ErrorLogger object. That would make it easy for the caller to append the messages to an the elog of a Seismogram or TimeSeries object.

wangyinz commented 4 years ago

btw, one thing I realized when writing the test is that the copy constructs were not implemented through python binding. This should be something easy to fix though.

pavlis commented 4 years ago

I'll need to look into this. None of the examples I remember seeing in the pybind documentation show an example of a wrapper for a copy constructor. operator= does no need to be wrapped, so I had made an assumption a copy constructor was also implicitly wrapped. What I realize after you say this, however, is that operator= by default may not create a deep copy like it does in C++ but may just pass a PyObject pointer from a constructor. i.e. we may need to have wrappers for copy constructors as the standard way to make a deep copy.

I will test this when I get to implementing changes to MetadataDefinitions. Thanks for pointing thi sout.

pavlis commented 4 years ago

I went ahead an implemented a generic alias handling method in MetadataDefinitions. The API is a little different than mentioned above, but not hugely. A key design decision I had to make perhaps worth a discussion here is copy versus move semantics for applying an alias. I chose move semantics. i.e. the new apply_aliases method takes data associated with an old key, puts the same data but using a new key, and then deletes the data associated with the old key. An alternative would have been copy semantics where the new key has a copy of the old key's data. In my opinion move is a better model than copy here for one simple reason: if one applies a set of aliases and then sets data associated with an alias key, the old an new values will be out of sync. If the users isn't careful to clear their aliases before moving on to other processes, they could have mysterious behavior where an attribute they tried to change didn't change. The bad side of move semantics is if the user doesn't clear the aliases before running another step that doesn't know about those aliases it would cause the job to abort. In my opinion, however, stopping on a clear error is better than creating mysterious errors that are hard to sort out downstream. I welcome your opinion on this. Thought about allowing either, but thought that would only add an unnecessary layer of complexity. I think it better to just be clear on this point in the documentation.

This also required a change to Metadata. I added a change_key method that does the above operation. MetadataDefinitions justs handles the alias namespace. Incidentally, I also made it dogmatic about alias registration. It will silently refuse to implement and alias that is not registered by MetadataDefinitions. This means we are going to have to create a more extensive set of aliases in the yaml file.

pavlis commented 4 years ago

I have the new apply_aliases and clear_aliases methods working for MetadataDefinitions. The only testing I've done is a set of lines in the jupyter notebook tutorial on Metadata. I just checked in those revisions. There are some residual issues discussed in this thread we should try to resolve before we close this thread:

We need a standard alias list for sac, obspy, and probably antelope. Doing so sooner rather than later is probably wise because it will undoubtedly reveal holes in the schema - especially any common missing aliases. It will also reveal many already obvious confusing unique key definitions. I think you should look at the current definition skeptically and try to make it simpler. e.g. I think the capital letter names where I took the SAC name as definitive are very confusing - probably should just only use lower case. Please give that some thought.
Once we have standard alias lists the question is where they should be saved? Probably best as some small wrapper functions with names like those you suggested above (e.g. apply_sac_alisas). If so, in what module should they appear?
I need to do some testing and really understand this issue of copy constructors in pybind11. I think I may be making it too hard, but this pybind issues page shows that there can be some complexity. I think, however, that if the C++ class has a declared copy constructor it is straightforward and I need to just an an init line for each class that has a copy constructor. The reason for the complexity, I think, is how memory is managed. An example will make this clearer:
```
md=Metadata()
# assignment - I think this just creates a pointer to the same data and uses reference counting
# to manage memory
md2=md   
# a copy constructor like this should create a deep copy 
md3=Metadata(md)
```
I need to test that this is correct, but perhaps you can educate me and avoid some wasted time

pavlis commented 4 years ago

I confirmed that the idea I had above is correct. i.e. assignment is only creating a pointer to the same data. A copy constructor is needed to create a deep copy. That is a mismatch with C++ as there assignment normally creates a new copy. I did not realize that and may have created some errors in python functions I wrote due to this misunderstanding.

I will now go through and add copy constructors to the pybind11 wrappers. They are clearly needed.

pavlis commented 4 years ago

Copy constructors have been added, but none have been tested in python. Unclear if that is necessary as the C++ code has been pretty well exercised, although not systematically.

wangyinz commented 4 years ago

That's great! Yes, in Python, assignment is always assigning reference. That's probably the most confusing part for C programmers. btw, do you mean that there could be such bugs in the Python code? This could be difficult to find, but I guess that's why we need the tests.

pavlis commented 4 years ago

Yes, I did mean there may be errors in the python code from my misunderstanding of that issue.

Had some time this morning and that exercise with copy constructors led me to spend a couple hours I had this morning working on the Ensemble objects. Found some bugs in building the pybind11 wrappers I hadn't detected as I hadn't really touched the Ensemble implementation before. Once I remembered a couple of tricks and read one page in pybind11 issues it turned out to be easy to implement. Ensemble is actually a template class, but I used a feature I hadn't fully used before in pybind11 to implement template classes. It was pretty slick, but since the sematics don't exist in python I used the wrappers to define two new names exposed to python: TimeSeriesEnsemble and SeismogramEnsemble. The first is the same as seispp, but the second uses a name consistent with Seismogram replacing ThreeComponentSeismogram. I did minimal tests on the wrappers but they seem to be functional.

I put this in this issues page because it isn't quite clear to me yet you we should implement Ensembles with mongo. Seems we should have a load and save method for ensembles in your new Database object, but I haven't quite thought through how an ensemble should be defined with Mongo. Probably would want to have the load method pass some query to Mongo as the generic method, but perhaps it would make some sense to build some common queries to simplify the use to a user. e.g. A couple of obvious one would be load_event and load_station.