MongoDB index keys - Githubissues

pavlis commented 4 years ago

I just encountered our first example of a query pattern that demands an index. There may be a better algorithm to do this, but using the db is the more scalable and general solution so it is worth pursuing a bit I think.

The problem I'm working with is matching arrivals times with seismic data with which that arrival pick or theoretical time is associated. This is a problem css3.0 as implemented by Antelope sucks. A long standing headache in working with segmented data in antelope is that their wfdisc table has real problems if you have waveforms that overlap. That problem happens all the time if you do what I just did to assemble a data set. That is, I took a list of earthquakes and downloaded fixed time windows defined by a time range around the earthquakes origin time. There is similar more complicated pattern we deal with often with segments defined relative to a specific seismic phase. e.g. one might ask for data in the range of 30 s before P arrival to 5 minutes after the arrival time for each station in a list. In either case the problem at hand is this:

Imagine you have the waveform data in a big heap (a file system) scattered through a bunch of files. The general problem we need to deal with in MsPASS is it shouldn't matter how those files are organized. We want to use the database to assemble our organized data set.
For the time being let's assume the data are all in miniseed or seed format. Then they have a natural top level index in the network and station code.
Assume we go with the model I've been prototyping where the files are parsed and the metadata describing the data in the collection of files is placed in a MongoDB collection. That collection has attributes for net, sta, startime, and endtime for each piece of data along with path information on how to locate the file where that data resides.
Assume we also have an "arrival" collection. Arrival also has a net, sta, and a time field but here the time field is a phase arrival time.

For this general problem I'm about 99% sure we should construct a compound index for both the waveform and arrival collections using the following keys is this order: network code, station code, and one or more time fields. As I read documentation on MongoDB that should dramatically improve any query to link waveform and arrival data. I'm going to test this theory on an example I have running. The prototype code without an index is running incredibly slowly cross referencing a data set with the order of a few thousand waveforms with an arrival collection are out a half million documents. I'm going to run timing test before and after building the index on arrival. I'm betting the effect will be dramatic.

pavlis commented 4 years ago

Ran a test of the impact of building a net,sta,time index on lookup of docs in the arrival collection. The arrival collection here is of the order of 380,000 documents. The function involved looks up one doc in the entire collection for each seismogram processed. This table gives the number of seismograms processed and the time required with and without the index

Number_seis T_noindex T_indexed 3816 721.0 19.9 3807 665.8 50.6 3803 756.4 22.8

So, as I suspected the index made a huge difference. We should use that same index on site and channel. For the source collection we need to revisit the spatial query feature of MongoDB.

wangyinz commented 4 years ago

Cool! I think I've mentioned this earlier in an email or something, but anyway, when revising the mspass.yaml file, I found that we will need a way to specify indexes in our MetadataDefinitions. All we need is probably just another boolean under each of the attributes in mspass.yaml.

wangyinz commented 4 years ago

Well, actually, it can be more sophisticated than a boolean - we probably should also indicate the index type there.

mspass-team / mspass

MongoDB index keys #81