mspass-team / mspass

Massive Parallel Analysis System for Seismologists
https://mspass.org
BSD 3-Clause "New" or "Revised" License
30 stars 12 forks source link

Handling data marked dead #69

Open pavlis opened 4 years ago

pavlis commented 4 years ago

Long drive from IN to PA gave me time to think on a issues we will face. One I came up with is that we should give some thought to how to uniformly handle data killed = marked dead. In seismic reflection processing standard practice is to carry such data along as baggage. The reason in seismic reflection processing are twofold: (1) multichannel recording and the concepts of a section make the matrix a standard conceptual model for data and if a matrix has a dead column/row it cannot just be deleted; and (2) an implicit assumption in all reflection data acquisition is that killed data are a small fraction of the data set. Both of these properties are rarely appropriate with earthquake data. Hence, carrying along data flagged as dead is a waste of resources, particularly if the volume of dead is a significant fraction of the data.

I think a way to handle this is to define a (python) function to handle data marked killed, remove them from the datastream, but record the reason they were killed. I have the perfect name for this function given the origins of the name python: bring_out_your_dead. If you don't get the joke watch Monty Python's "In Search of the Holy Grail" and you will - a perfect match to the origins of the name for python. Could be one function few users will forget the name. Some issues:

Thoughts?

wangyinz commented 4 years ago

lol, I had to search for that film, but I can totally see it is quite a "pythonic" name.

I think the way to do it is to have a filter operation that selects all the dead traces, and then have a database call to save the elogs of them. The former could be a function in our parallel API, and the latter need to be implemented in our database API as a generic function like save_elogs.

pavlis commented 4 years ago

I was thinking about the same thing. Wanted to get it in your head since you guys are working on the parallel api right now and it would seem a thing to include to reduce data volume dramatically in some workflows.