Handling data marked dead

pavlis commented 4 years ago

Long drive from IN to PA gave me time to think on a issues we will face. One I came up with is that we should give some thought to how to uniformly handle data killed = marked dead. In seismic reflection processing standard practice is to carry such data along as baggage. The reason in seismic reflection processing are twofold: (1) multichannel recording and the concepts of a section make the matrix a standard conceptual model for data and if a matrix has a dead column/row it cannot just be deleted; and (2) an implicit assumption in all reflection data acquisition is that killed data are a small fraction of the data set. Both of these properties are rarely appropriate with earthquake data. Hence, carrying along data flagged as dead is a waste of resources, particularly if the volume of dead is a significant fraction of the data.

I think a way to handle this is to define a (python) function to handle data marked killed, remove them from the datastream, but record the reason they were killed. I have the perfect name for this function given the origins of the name python: bring_out_your_dead. If you don't get the joke watch Monty Python's "In Search of the Holy Grail" and you will - a perfect match to the origins of the name for python. Could be one function few users will forget the name. Some issues:

The obvious way to preserve what happened is to have the function fetch and save the elog of the data being deleted. I think we have a prototype for this already in the mongo db handle.
I'm not sure how this would interact with a scheduler like SPARK. I think you could characterize this as a reduce operator but with a one to nothing reduction. Does SPARK have a mechanism for this construct?
This would also need to have a history preservation step for data with history defined. I think that too would be like the elog preservation and would invoke the same save methods for live data except the chain would always end at the kill.

Thoughts?

wangyinz commented 4 years ago

lol, I had to search for that film, but I can totally see it is quite a "pythonic" name.

I think the way to do it is to have a filter operation that selects all the dead traces, and then have a database call to save the elogs of them. The former could be a function in our parallel API, and the latter need to be implemented in our database API as a generic function like save_elogs.

pavlis commented 4 years ago

I was thinking about the same thing. Wanted to get it in your head since you guys are working on the parallel api right now and it would seem a thing to include to reduce data volume dramatically in some workflows.

mspass-team / mspass

Handling data marked dead #69