Open pavlis opened 4 years ago
Brief followup here from our phone conversation for the record.
I looked at the MongoDB documentation and find I was wrong about BSON supporting an XML data type. However, I think the model for handling metadata in our source and site documents we discussed is still feasible and in fact prudent. Let me summarize it here for the record.
Please add to this if you have any additional insights.
Another item related to this. I just ran across this in the obspy 1.2 documentation. I think it is new in 1.2 and is the result of their Jane development. Here is the documentation page for the current features. Not clear to me at all what is needed to make this work. This appears to be the way they are handling variants in metadata for the same station or event. Look especially at the client section
Wanted to show you this as I think it demonstrates this idea. This will work but it uses a dump algorithm to simply blindly write the contents into the database site collection without checking if the entry already exists. I can implement that pretty quickly and will try to do that today, in fact, but wanted to pass this along to you. The docstring says this but basically this function will create an image of an obspy Inventory object in mongodb pulling what I viewed as universally necessary attributes as named key:value pairs. The obspy inventory representation of the original xml file is stored with pickle.dumps. That should allow restoring the full inventory data to be faster than parsing the raw stationxml file. I've tested it with pickle.loads and it is lightening fast as we would expect. Anyway, here is the prototype code for you to play with if you choose:
import pickle
import pymongo
from obspy import Inventory
from obspy import UTCDateTime
from obspy import Catalog
def _extract_edepths(chanlist):
"""
Parses the list returned by obspy channels attribute
for a Station object and returns a dict of unique
edepth values keyed by loc code. This algorithm
would be horribly inefficient for large lists with
many duplicates, but the assumption here is the list
will always be small
"""
alllocs={}
for chan in chanlist:
alllocs[chan.location_code]=chan.depth
return alllocs
def dbsave_inventory(db,inv,firstid=-1):
"""
Saves contents of all components of an obspy inventory
object to documents in the site collection. No checking
for duplicates is preformed so every component of the
input inventory will generate a new document in collection.
The design depends on the using a some other mechanism to
sort out pure duplicates from legimate station definitions
to handle issues like orientation changes of components,
gain changes, etc. We emphasize this point because this
function should be used with care for large data sets.
Careless application to the output of buld downloads
with obspy using web services could generate large numbers
of unnecessary duplicates. Use this function only when
the inventory object(s) to be saved do not have excessive
numbers of duplicates.
:param db: is a top level MondoDB database handle. The
algorithm immediately uses it to lookup the site collection.
:param inv: is the obspy Inventory object of station data to save.
:param firstid: Set the initial value for site_id internal
integer id key. If negative (the default) it will be
extracted from the database. If the existing site
collection is huge that can be inefficient, but
using anything but the default should be rarely
needed.
:return: integer giving the number of documents saved.
"""
# This constant is used below to set endtime to a time
# in the far future if it is null
DISTANTFUTURE=UTCDateTime(2051,1,1,0,0)
# site is a frozen name for the collection here. Perhaps
# should be a variable with a default
dbcol=db.site
if(firstid<0):
x=dbcol.find_one(sort=[("site_id", pymongo.DESCENDING)])
site_id=x[0] #test this - may be wrong
else:
site_id=firstid
nsaved=0
for x in inv:
# Inventory object I got from webservice download
# makes the sta variable here a net:sta combination
# We can get the net code like this
net=x.code
# Each x now has a station field, BUT tests I ran
# say for my example that field has one entry per
# x. Hence, we can get sta name like this
y=x.stations
sta=y[0].code
starttime=y[0].start_date
if starttime is None:
print('station ',sta,' does not have starttime define. Set to epoch 0')
starttime=UTCDateTime(0.0) # epoch 0 time
endtime=y[0].end_date
if endtime is None:
print('station ',sta,' has no endtime defined. Set to distant future')
endtime=DISTANTFUTURE
latitude=y[0].latitude
longitude=y[0].longitude
# stationxml files seen to put elevation in m. We
# always use km so need to convert
elevation=y[0].elevation/1000.0
# loc codes go may have different edepths, which
# obspy sets as a depth attribute. This little
# function returns a dict keyed by loc with edepths
chans=y[0].channels
edepths = _extract_edepths(chans)
# Assume loc code of 0 is same as rest
#loc=_extract_loc_code(chanlist[0])
rec={}
picklestr=pickle.dumps(x)
all_locs=edepths.keys()
for loc in all_locs:
rec['loc']=loc
rec['site_edepth']=edepths[loc]
rec['net']=net
rec['sta']=sta
rec['site_lat']=latitude
rec['site_lon']=longitude
rec['site_elev']=elevation
rec['site_starttime']=starttime.timestamp
rec['site_endtime']=endtime.timestamp
# needed and currently lacking - a way to get unique
# integer site_id - for now do it the easy way
rec['site_id']=site_id
rec['serialized_inventory']=picklestr
dbcol.insert_one(rec)
nsaved += 1
site_id += 1
return nsaved
I just noticed the docstring statement in that code that says the idea would be to filter such things out later. I've changed my mind and think that because this is a thing done only on small data sets it can be done inline with little to no pain. As noted earlier that is what I'll try to implement today.
Made some progress on this yesterday. Have a fully functional function to load quakexml files into the source collection and stationxml files into the site collection. I decided to make the site version avoid duplicates based on the key of net:sta:loc:ondate:offdate i.e. it tests if a document those 5 keys already exist and will not add a new document if it has one with the same values. Those are the same keys as css3.0 and seem to be what FDSN is using. The source collection is different. Seemed to me to provide the same functionality of css3.0 that allows multiple origins for the same "event" we should allow duplicates. There I do no checking and assume is a secondary problem of linking data to the right document in the source collection. I am certain this will require some changes to normalize_source to deal with the new structure.
I am close to finishing this, but have a question. Seems you wanted to rearrange the python module directories. Did you check that change into master? Also, where these belong in a tree of names for import is not so clear to me. I am almost finished testing six primary core python functions:
These provide most of the functionality we should need for source and receiver data. There are some bells and whistles for embedded queries but those will probably be used only by you and me and users who know mongodb queries.
So where should I put these? Clearly somewhere in the mongodb directory if you retain that, but we what should the module they live in be called? Should we rename one of the existing files and put them together?
Note - I probably won't be able to act on this for a couple of days but would like to know your opinion to avoid excessive commits of later reorganizations.
That is a tricky question. So, I have the work-in-progress on a new branch named reorganize_python. The organization basically follows what I listed in #54. I think all you listed here belongs to mspasspy.db. If we put it there, we should omit all the "db"s in the function names. I think you should be able to just add these functions there as I am done working on that module.
Ok, step one will be changing names like dbload_inventory to load_inventory. Might get this checked in today, but depends on how smoothly the last function (load_catalog) debug goes.
That went faster than i expected. How do you want me to check this in? Thinking I should pull the reorganize_python branch and add the new functions to the db module.
Other question, however, is where should I install the test program? It has data files that go with it. Suppose I could automate the test and have the files downloaded by web services, but that could make the test really lengthy as the data files I use for stations came from an event download of all lower 48 broadbands. I can and could change the scale of the web service request and still have a reasonable program for a test suite. We DO need a mechanism for a clean test suite for python modules like this. I await your guidance or where to put this and how to handle this and other test suite programs.
I am done with the reorganizing, so it is safe for you to check in those changes to the reorganize_python branch. Test is the next thing I will be working on. For now, just check in what you have into the test directory, I will try to get all tests working under some modern framework. Not quite sure how to do that yet, but I at least know where to look for tutorials on that topic.
I placed the new functions in the db.py module and a test program in mspass/python/tests. The revisions are on the reorganize_python branch. The test program has the verbose name test_inv_cat_dbfunctions.py.
This test program is particularly problematic because it is driven by two files created by using obspy's web services modules. I did not want to muddy up the repository with a set of data files that were messy and probably much larger than needed. It can and probably should be possible to reduce the test to a handful of stationxml files concatenated into a single test file and a quakexml file produced similarly. First step, however, is to make sure you can reproduce this test in some approximate form.
To run the test you will need to do two things. First, here the script I used to test obspy's mass downloader. It is a minor variant of the example (see here) they give in their documentation:
import obspy
from obspy.clients.fdsn.mass_downloader import RectangularDomain, \
Restrictions, MassDownloader
domain = RectangularDomain(minlatitude=20.0,maxlatitude=54.0,
minlongitude=-135, maxlongitude=-50)
# This is from obspy tutorial for Tohoku event
origin_time = obspy.UTCDateTime(2011, 3, 11, 5, 47, 32)
restrictions = Restrictions (
starttime=origin_time - 15*60,
endtime=origin_time + 3600.0,
reject_channels_with_gaps=True,
minimum_length=0.95,
minimum_interstation_distance_in_m=100.0,
channel_priorities=["HH[ZNE12]","BH[ZNE12]"],
location_priorities=["","00","10"]
)
mdl = MassDownloader()
mdl.download(domain,restrictions,mseed_storage="wf",
stationxml_storage="sitexml")
Note this will produce both miniseed files in wf and stationxml data in the sitexml directory. The test only uses the sitexml data.
The final thing is to create a test file of quakexml data from web services. I did that with some one line command, but I'm having trouble finding the exact tuturial page that shows the process. I think you will need to use the get_events obspy function with the right incantation to tell it to save the results to a file in xml format.
When that is done you will need to manually edit the file names in mspass/python/tests/test_inv_cat_dbfunctions.py to point at the data you downloaded. I reiterate that I think for a test suite program we should limit the size of the station and source xml files and just put two much smaller files in the tests directory. You will quickly learn that there is a large overhead in cracking the xml files. The data I was using had about 1200 stations and 2700 events. Both reads took 10s of seconds to complete. Not awful, but not something you want to do all the time. The db load routines are lightening fast by comparison even when reconstructing the full inventory and catalog objects with pickle.loads. The parsers for the xml data are clearly a nasty barrier.
I just realized you may need to tweek the test program depending on how you select events. I have queries to test retrieving a single event that use source_id values form my example.
Cool! I can see the test dataset will become a problem if we don't standardize how we do the test for now. Ideally, we would want to include a small dataset that can be used to run all tests in the package.
I also noticed that it might be more appropriate to put some of the new functions in the io module as they seems to be something specific to importing ObsPy data. This might be a little confusing, but with current design mspasspy.db should be used for generic database operations, and mspasspy.io.XXX should be used to import and export data to XXX format. That's why there is a mspasspy.io.seispp, and we should plan for a list such as mspasspy.io.obspy, mspasspy.io.su, mspasspy.io.sac, etc. The stationxml and quakexml may also belongs there.
I think nothing there belongs in io. The test program uses the obspy io modules to read xml files. I don't remember the names. The functions I put in db.py are save or restore obspy Inventory and Catalog objects. There is no file io on any of them.
Sorry that I didn't state it clearly, but I'm not saying all the new ObsPy related function belongs to the io module. I was actually thinking those functions may need to be taken apart. For example, the save_catalog
function actually does two things: pickle and parse an ObsPy Event
object to dictionary, and saving a dictionary to the database. The second is the generic database operation, and the first one is something unique to ObsPy.
The purpose of io module is not literally for file io, but really to deal with importing, converting and exporting different formats that's not ccore
or Python built-in types. That's why I was thinking some of those functionality should be in mspasspy.io.obspy.
However, now having been thinking more about it, I see the most important question to answer first is to what degrees mspasspy should be dependent on ObsPy. If we treat ObsPy as the core of the package, then it make sense to put those functions under db module. If we see it as a third party package that is potentially swappable with something similar, then it should not be treated more different than seispp or sac, and we already have a mspasspy.io.seispp.
I actually think it may be reasonable to make ObsPy the core of mspasspy for now, as it really is powerful and handy for most of seismic data handling problems. Actually it would totally make no sense to not leverag ObsPy in mspasspy. The only concern will be if we want to generalize mspass to other domains later, the tight couple with ObsPy may become a baggage - not sure if that's really something to worry about at such an early stage.
Anyway, just want to see your ideas over the design of the modules. I am completely fine with either way.
I am increasingly convinced obspy should be considered a core component for mspass. It has clearly become a common tool used by many seismologists so building on it will only promote what we are doing. We could, in fact, advertise one of the potential applications people can view for mspass is adding database and parallel functionality to obspy. They have a whole lot of algorithms implemented that people can immediately benefit from. I really do think it will make this more popular as treating it as an obspy extension. We already have huge obspy dependencies.
I probably should state how I think the new functions I checked in here should be used. The dbload_inventory should provide a way to build a comprehensive full response information for any data downloaded by web services. If you look at how it runs it checks each entry it finds for a match for net:sta:chan:loc and time::endtime. It will not add a new document to site if there is a match. This is nice because I am not sure how well obspy handles the nasty ondate:offdate problems they case as attributes startdate and enddate. Their documentation is not clear on this point, but I think a stationxml file has ondate-offdate embedded the formal allows multiple ondate-offdate combinations in one file. The model I'm testing (running right now and will for many days) is downloading yearly blocks of lower 48 broadband data over the usarray deployment period. I think that I can take the stationxml files produced form each year and build a clean site collection that has no duplicates. Will be a while before we know that for sure.
The way I see using this site collection is as the foundation for building clean receiver coordinate data in all wf collections. We'll have to produce a different version of normalize_size for that class of data.
Source data (source collection = obspy Catalog) will be a tougher nut to crack perfectly. The reason is that the general problem of defining when two source estimates are equivalent is not trivial at all scales. Sloppy solutions aren't hard, but the general problem is full of pitfalls.
First, I like what you did to define a Database class for python for the mongodb interactions. This was prompted by something different though.
I wrote what I think will be a useful tutorial for Metadata, MetadataDefinitions, and AntelopePf objects. The tutorial is a jupyter notebook and the exercise has really made me a fan of that tool. What I realized, however, in describing those 3 core elements is that there is a hole that we need to fill. The hole is clear, but what I would like feedback on is how the hole should be filled.
I was writing a little section on aliases and how they are defined by MetadataDefinitions. When I looked back on this it made me realize how powerful this mechanism could be for building cross-connecting tools with packages like SAC or various matlab implementations folks out there have developed. What became obvious, however, was we need a fast and easy tool to do two operations:
We have a mechanism to do 2 during a load operation, but we don't have an easy way to do something similar in a workflow. Imagine, for example, someone wanted to use SAC for some sequence of steps and assume we developed some clean interface methodology to move data to and from SAC and to drive processing. If the workflow had steps before and after that needed the aliases cleared, however, the user has a problem.
The need is clear as a core algorithm, BUT how to implement and in what language is anything but clear. My first thought had two alternatives:
Opinions and ideas?
I also have some thoughts about MetadataDefinitions from writing the tests. The issue I found is that in the MetadataDefinitions.type(string key) call, when the key is an alias, it won't return the correct type but throw an error into elog instead. This happens a lot when converting a dict from ObsPy into Metadata. So, if we have an apply_aliases function, this issue could be resolved. However, this function ought to be able to not only taking Metadata as the input, but also dict type of Python. If so, we probably will have to have it, or at least have part of it implemented in Python. If we ignore the dict type, then I think option 1 is the cleanest solution here.
btw, for the option 2, I don't think we need separate functions for TimeSeries and Seismogram. Although I don't know it immediately, I do think Python has similar way to handle such polymorphism.
Since MetadataDefinitions is a hybrid python-C++ thing anyway I don't think there is any intrinsic problem having it handle a dict. I already know how to handle that although I'm pretty sure the "devil is in the details" in his case.
We may want the Metadata and dict versions to behave differently. I can see good reasons to have a method or procedure that returns a dict that will replace all aliases with the unique key, which is what we need for database saves. The other case is the one that came to mind to create this dialogue - needing to turn aliases on and off within a workflow. In that case, the method/procedure should modify Metadata in place. I think you are right that python through pybind11 will correctly handle multiple inheritance as long as the construct is not intrinsically ambiguous (e.g. a bad design that would have methods with the same name in two parents). I think that handling polymorphism of the main reason pybind11 require this confusing trampoline class to define the wrappers. I confess to not fully understanding why trampoline classes are needed and what they actually generate for a data structure to implement RTTI (Real Time Time Identification) in python. I just know it is essential to create a trampoline class if you have a base class with virtual methods or polymorphism will not work correctly on the python side.
Let me know your thoughts on this. I think I'll think about a design before I do any coding. Look for updates in this issues section.
I decided to jot down my design ideas in this issues section. I'm starting with an assumption we should implement this as a revision to MetadataDefinitions. I think we could implement this with three and maybe only two methods:
int MetadataDefinitions::apply_aliases(Metadata& d, py::list aliases);
with overload
int MetadataDefinitions::apply_aliases(Metadata& d, list
Update a Metadata object d to apply alias names defined aliases. Note we only need a list because the is_alias and unique_id methods can test if an alias is valid and treat illegal values as an error. In this design it would return the number of entries actually changed. It might be better to return a py::list of any failures or an ErrorLogger object. I'm not sure the overload is needed, actualy, as I think an std::list in C++ is mapped directly to a py::list by pybind11.
The inverse should be something like this: int MetadataDefinitions clear_aliases(Metadata& d); Would convert all aliases in d to replace alias key with unique key for an attribute. Return number of items changes. As above, we might want something more elaborate like an ErrorLogger object or a list of any failures (can fail if a key is undefined in the MetadataDefinitions).
Note using the reference and modifying Metadata in place is essential. If an application for some reason need to retain the original they could make a deep copy easily with this construct:
mdcopy=Metadata(d)
where d could be an actual Metadata, TimeSeries, or Seismogram object.
That looks good to me. One thing to consider down the road is to have default lists of aliases for different packages, or we could just define something like int MetadataDefinitions::apply_obspy_aliases(Metadata& d);
or int MetadataDefinitions::apply_sac_aliases(Metadata& d);
. Not something important for now, but definitely will be handy for a lot of the users.
G ood point, but think there is an a simpler solution for those two examples: we could build a pair of functions with this signature in python:
def sac_aliases()
def obspy_aliases()
Each would return a dict that could be passed to the apply_aliases method.
I think I will write the methods mentioned above to return an ErrorLogger object. That would make it easy for the caller to append the messages to an the elog of a Seismogram or TimeSeries object.
btw, one thing I realized when writing the test is that the copy constructs were not implemented through python binding. This should be something easy to fix though.
I'll need to look into this. None of the examples I remember seeing in the pybind documentation show an example of a wrapper for a copy constructor. operator= does no need to be wrapped, so I had made an assumption a copy constructor was also implicitly wrapped. What I realize after you say this, however, is that operator= by default may not create a deep copy like it does in C++ but may just pass a PyObject pointer from a constructor. i.e. we may need to have wrappers for copy constructors as the standard way to make a deep copy.
I will test this when I get to implementing changes to MetadataDefinitions. Thanks for pointing thi sout.
I went ahead an implemented a generic alias handling method in MetadataDefinitions. The API is a little different than mentioned above, but not hugely. A key design decision I had to make perhaps worth a discussion here is copy versus move semantics for applying an alias. I chose move semantics. i.e. the new apply_aliases method takes data associated with an old key, puts the same data but using a new key, and then deletes the data associated with the old key. An alternative would have been copy semantics where the new key has a copy of the old key's data. In my opinion move is a better model than copy here for one simple reason: if one applies a set of aliases and then sets data associated with an alias key, the old an new values will be out of sync. If the users isn't careful to clear their aliases before moving on to other processes, they could have mysterious behavior where an attribute they tried to change didn't change. The bad side of move semantics is if the user doesn't clear the aliases before running another step that doesn't know about those aliases it would cause the job to abort. In my opinion, however, stopping on a clear error is better than creating mysterious errors that are hard to sort out downstream. I welcome your opinion on this. Thought about allowing either, but thought that would only add an unnecessary layer of complexity. I think it better to just be clear on this point in the documentation.
This also required a change to Metadata. I added a change_key method that does the above operation. MetadataDefinitions justs handles the alias namespace. Incidentally, I also made it dogmatic about alias registration. It will silently refuse to implement and alias that is not registered by MetadataDefinitions. This means we are going to have to create a more extensive set of aliases in the yaml file.
I have the new apply_aliases and clear_aliases methods working for MetadataDefinitions. The only testing I've done is a set of lines in the jupyter notebook tutorial on Metadata. I just checked in those revisions. There are some residual issues discussed in this thread we should try to resolve before we close this thread:
md=Metadata()
# assignment - I think this just creates a pointer to the same data and uses reference counting
# to manage memory
md2=md
# a copy constructor like this should create a deep copy
md3=Metadata(md)
I need to test that this is correct, but perhaps you can educate me and avoid some wasted time
I confirmed that the idea I had above is correct. i.e. assignment is only creating a pointer to the same data. A copy constructor is needed to create a deep copy. That is a mismatch with C++ as there assignment normally creates a new copy. I did not realize that and may have created some errors in python functions I wrote due to this misunderstanding.
I will now go through and add copy constructors to the pybind11 wrappers. They are clearly needed.
Copy constructors have been added, but none have been tested in python. Unclear if that is necessary as the C++ code has been pretty well exercised, although not systematically.
That's great! Yes, in Python, assignment is always assigning reference. That's probably the most confusing part for C programmers. btw, do you mean that there could be such bugs in the Python code? This could be difficult to find, but I guess that's why we need the tests.
Yes, I did mean there may be errors in the python code from my misunderstanding of that issue.
Had some time this morning and that exercise with copy constructors led me to spend a couple hours I had this morning working on the Ensemble objects. Found some bugs in building the pybind11 wrappers I hadn't detected as I hadn't really touched the Ensemble implementation before. Once I remembered a couple of tricks and read one page in pybind11 issues it turned out to be easy to implement. Ensemble is actually a template class, but I used a feature I hadn't fully used before in pybind11 to implement template classes. It was pretty slick, but since the
I put this in this issues page because it isn't quite clear to me yet you we should implement Ensembles with mongo. Seems we should have a load and save method for ensembles in your new Database object, but I haven't quite thought through how an ensemble should be defined with Mongo. Probably would want to have the load method pass some query to Mongo as the generic method, but perhaps it would make some sense to build some common queries to simplify the use to a user. e.g. A couple of obvious one would be load_event and load_station.
While waiting for IU sysadmins to install the new Totalview program for debugging mixed C and python code, which is what we have, I've been playing with obspy and considering how we might design a reasonably generic tool to assemble a dataset with current generation web services. We don't want to get too buried in this problem as there are too many special cases to cover everything, but I think we should strive to make a tool generic enough to cover a reasonable share of cases. A first example of that is my selfish interest is a focus on events. I consider the continuous data management problem solved with antelope or variants of a tutorial script in the obpsy tutorial. So, we should focus on what should be in a tool to handle assembling an event-based dataset. I want to bounce around some ideas before I try any serious implementation.
The first issue is to be clear what we need to assemble to build a raw dataset? Wasn't once true, but the issues are now well sorted out into three separate problems:
So some key generalizations I think can help us make a tool like this most generic:
As the last sentence says, the next step is how we might actually implement a prototype. obpsy has made some huge advances to address these issues in the past few years. The core tool they now have to make this more efficient is the mass downloader. It provides some of the generic functionality but not all. It is, however, the clear starting point. Overall the process with need pretty much define each of the three items above:
This strategy would reduce the algorithm to a loop over events with one web service request per event. Each request would append data as files into two directories: (1) the waveform would be created as one file per channel ala SAC which be default would be in one directory (big problem), and (2) station data gets pushed as stationXML format files in another directory (default is full response info, which is probably advised).
This strategy should work, but I know for absolutely certainty has two issues we need to resolve:
Finally: we should probably have the current version of obspy (1.2.0) in the master container and for all development. I didn't check when that version as released but I noticed the versions I had on all my machines were older until I did an update. I think there were improvements to the web service stuff in this new version.