mspass-team / mspass

Massive Parallel Analysis System for Seismologists
https://mspass.org
BSD 3-Clause "New" or "Revised" License
30 stars 12 forks source link

suggested feature to add to merge function #476

Open pavlis opened 10 months ago

pavlis commented 10 months ago

Our merge function found in mspasspy/algorithms/window has a somewhat unnecessary restriction we can and probably should fix. The only reason I don't just do it is is that there are multiple ways to fix the same problem.

The problem is that merge currently insists that it get a DoubleVector of TimeSeries objects AND the vector MUST be in starttime order. Currently it returns a dead datum if that condition isn't satisfied. That seems a bit unnecessarily pedantic for a problem easily fixed with a couple of lines of code to fix the input error. I think I wrote this code originally and I can't understand why I put that restriction in there in the first place.

The problem is easily addressed. There are at least 3 ways I can think of to solve it.

  1. My suggestion is to handle it in the C++ function that does the main work of this function behind the python wrapper merge. The C++ algorithm::sort function will do that and do it about as fast as possible with a custom sort function using the TimeSeries attribute t0.
  2. I am pretty certain you could do the same thing with a python sort function of which I am sure there are many options. Unclear what, if any, the performance hit would be.
  3. We could just ignore it and put a description in the docstring including a code example I'll show below on how to handle that with database input.

For the record, 3 is easy to do with the new Database implementation. Here is an example:

cursor=db.wf_miniseed.find({'sta':'A4100','chan':'LHZ'}).sort('starttime')
ensemble_read = db.read_data(cursor,collection='wf_miniseed')

I think 3 is a terrible solution for a problem this easy to fix. I put it in the record more as a useful trick for anyone doing this to avoid the overhead of having the code do the sorting when it detects this error. In general, if one were reading continuous data to run merge db queries limiting the time range (not show above for simplicity) and limiting the result to a common channel as above should be recommended. Note also, by the way, with the upcoming v2 of Database that would be very parallelizable with the new read_distributed_data being driven by a list of db queries dict's