mspass-team / mspass

Massive Parallel Analysis System for Seismologists
https://mspass.org
BSD 3-Clause "New" or "Revised" License
30 stars 12 forks source link

using geospatial features in MongoDB #39

Open pavlis opened 4 years ago

pavlis commented 4 years ago

I have been playing with importing data into MongoDB most of the day. Have made A LOT of progress quickly. I now can fully load at least the Metadata from seispp data files converted with the new antelope export_to_mspass program. I have all the data from 2004 and 2005 gt 6 earthquake from usarray. That is about 5000 headers.

At the same time I've been building up the master schema of names. In working with MongoDB today, however, I ran across something that solves a problem very fast that I had been wondering how to do. That is, how to build master source and site documents. Both suffer from the issue of needing to compare numbers that are almost by not exactly equal and for which the metric of what defines "close" is data dependent. i.e. "close" is much smaller for the site document (table) than for source data. Also the source data has to include a time element, which will add a minor but solvable complication.

I think the approach we should use is that for coordinate data we should load waveforms with source and receiver locations in the wf document. The overhead is tiny compared to waveform data and worst case the wf document could be cleaned if needed. Anyway, the point is it should not be difficult with the geospatial features in MongoDB to scan a wf document pulling out all comparable coordinates and reinserting them as geospatial data. The a second function could be written (a bit harder due to the need to deal with elevation or depth and (for sources) time) that would find all redundant coordinates and write a master site or source document (table).

The potential negative is I'm not totally sure this will be that helpful due to the 2d limitation of the geospatial package. That is important because there are lots of examples of sources that are "close" by a reasonable metric but have different depths and occurred at different times. Happens for many GSN stations too where they surface and borehole sensor with elevations that are different enough to matter.

So I put this out there for discussion and consideration and as a reminder to consider this approach.

wangyinz commented 4 years ago

I just read MongoDB's geospatial related documentations, and it is actually improved a lot since last time I looked at this topic. The GeoJSON Object now supports a list of object types which could be helpful for things like visualization and interpretation data.

I am aware of the 2D limitation of this geospatial feature. I think this is a very common issue, since most of the customers of such a database would only need 2D. Just as you have said, we should just make our own query for the 3rd-D and time after using the built-in 2D ones. I think that should work fine. It should be straightforward to do it with a single aggregation operation.

pavlis commented 4 years ago

After working on this some more, I quickly realized (remembered in some cases) the subtle complexities of the issues we face here. There are two problems that are similar but different enough I think they need to be treated differently.

  1. Receiver coordinates might seem a simple issue, but it is far from that for modern passive array data. There are three issues: (1) observatory class stations (like GSN) that have multiple sensors at the same site (commonly defined by multiple loc codes), (2) as special-and common-case is where one of the loc codes are a borehole instrument and other is a vault, and (3) possible time dependence of locations - it is not unheard of for a station to retain the same name but move a "small" distance. Note the response and orientation issues are even uglier but I think we should treat those as different problems. They are likely best solved by antelope and/or web services. I think all three issues can be handled by making the definition of "close" soft. The initial implementation for handling this uses a 2d distance in degrees combined with a different metric for elevation differences. i.e. I define two locations equal when the 2D distance is less than a threshold (defined by a defaulted python argumetn) AND a different threshold for elevation differences. Handling stations that move a small distance will be dependent on how far that move is relative to that threshold. That is a pretty rare event, and perhaps is not worth obsessing over.

  2. Source locations are a different challenge for two reasons: (1) time is an important additional soft variable that provides critical constraints, and (2) there may be several location estimates and how to resolve which is best is not unambiguous (the prefor problem of CSS3.0). We could obsess over this problem, but my recommendation is we adopt this axion: catalog preparation is not our problem but has been solved by multiple external system. Antelope, earthworm, and IRIS-FDSN all have internal solutions to this problem. We should simply tell users this needs to be solved by one of those alternatives. At most we should supply solutions as part of the documentation and things we develop for testing. This is a thorny problem with so many picky issues we should push it off as a user problem - assembly of your data set is your problem. Best we can do is supply some of the tools to do that and to validate the results. My first step in this direction is a thing we need immediately anyway - a way to build an events collection.

I think I know how to handle the receiver problem well enough. We can do that pretty crudely, I think, and won't fact scalability issues. All open data that currently exist are of the order of 10000, which can be handled purely in memory with any program without stressing memory use. Source locations are a different story as there are many catalogs with millions of location estimates. I have some ideas to make this fancier that build on an EventCatalog object I created years ago in Antelope, but my initial thought is that is overkill. I'm inclined to remember that mantra I have given you several times: make it work before you make it fast.

pavlis commented 4 years ago

Pushed the wrong button. Didn't mean to close the issue.

wangyinz commented 4 years ago

I think the solution for receiver location looks good enough with a soft criteria based on 2D distance and elevation separately. I always assumed that if a station is moved, it would always be assigned a new loc code, so a waveform defined by net.sta.loc.chan should always be time-independently unique. I thought all we need to do is add the loc key to what Antelope has.

I agree that we don't need to go into the issue of source locations. Users should just go with whatever a catalog defines. We can definitely get it more fancy when needed down the road, but we can worry about that later.

pavlis commented 4 years ago

From my experience station Metadata is a type example of Murphy's Law. In this case, stated this way - anything you can or cannot imagine could go wrong will. I've come around to the view that we should handle coordinates in isolation. In every case I can imagine all that really matters is a station site as a "point", but the point is not infinitely small. Equivalence of position scales with apparent wavelength, by which I mean wavelength along Earth's surface. If one is working with teleseismic P waves like you did for your dissertation, the "point" can be pretty large - hundreds of m or more. If you are working with free oscillation data it is kms in size. Other end member would be something like the Homestake data where there was signal to 1 kHz and sub meter precision is needed. That is why this idea of a soft position is exactly what we need.

The other side of it what causes CSS3.0 and, I suspect obspy, to get bogged down is to improperly link component orientation data and response data to a physical location. Those actually belong to an instrument-sensor combination that has different space-time properties. A point with a sphere drawn around it is a fixed thing forever while instrumentation changes all the time with real data. Hence, our design needs to keep these separated.

Source data are similar to receiver location data, but subject to two different issues: (1) time is a critical added parameter and (2) the issue CSS3.0 handles with event->origin (prefor). It seems we both agree that is a known, robust working solution to that problem and we shouldn't mess with it.

I think the next thing I'm going to do today is write a section of the documentation I had a stub for already in the html directory. That is, I am going to write a database design section that tells the logic behind what I just wrote here. I'm pretty confident now this basic design is what we will want:

wf - is the collection used to contain working data sets. I've come to think the documents in this collection should have complete headers/metadata that can be readily scanned for completeness with a program we could write in a few hours. You would just give this program a list of required attributes and a query that defines the dataset. The verification would then just scan that set of documents and log a message that could be used to locate the error and fix it. Normalized data (like receiver and source coordinates) would assumed to be elsewhere and different utility programs would be needed to validate the normalized data. I assume we would use your idea for describing partially processed data with the linked list to parents. Probably also need some other unique key to define a common stage for a full data set.

site - receiver coordinate data as noted above source - source coordinate data as noted above elog - error log data with cross-reference to wf. Note there will be a housecleaning problem with elog as it would be very easy to remove entries from wf leaving orphan log entries with no parent in wf.