Open wangyinz opened 4 years ago
Here are some thoughts on what mspass needs for global level management. Order is rough the priority of importance for functionality:
Those are the first things that came to mind.
Driving 4000 miles mostly on interstate gave me some time to think about a lot of things. One I thought about was how the global manager would work. I am thinking this is also related to how we define what a "processing module" or "processing function" is? The point I want to make here could have appeared in a different issues section, but because it is connected to our ideas of global history I decided to put i here. I'm reducing this for now to the concepts I have in mind.
The concept we need to implement is an abstraction of what a processing module/function is and how it should behave. Then wrappers for a particular algorithm can be thought of an implementation of the base class. Here in words are what I see as the functionality for this concept:
This is incomplete, but hopefully can start this discussion. Reading back I recognize what I wrote may not be very clear. Call this the opening salvo on this topic.
In today's zoom call we decided to visit this issue (Global History Manager) in next week's call. I said then and emphasize again a good way to iron out some of between now and then is to post some dialogue for the record in issues section. This isn't as volatile as the database api design was so I think it can be handled here reasonably well. The pieces above are sketchy and some of the things I said earlier I now realize were based misconceptions of spark that I don't have any longer, although I may have others.
Anyway, I can think of two generic approaches for setting up global history (are there others?):
Are there other approaches? There is likely a literature on this topic in computer science, alhough likely hidden behind jargon.
A related point from above that remains relevant and impacts the api for a global history manager. That is, we have examples already of two very different approaches to defining what an algorithm is:
This matters because they create very different constructs. Function calls always go like this:
d=data_creator()
...
d=functioncall(d,param1,param2, param3)
while *creation is initialization always works like this:
myproc=MyProcessor(p1,p2,p3) # constructor for MyProcessor call initializes its state
...
d=data_creator()
...
myproc.apply(d) # run the myproc algorithm on d - could also return a new object
The first is potentially more dynamic as noted above while the second is easier to set up with a simple history manager - it only needs to register itself on creation.
Had another thought on this issue this morning. This particular idea would build on the init-exec model from seismic reflection processing and the approach turning parallelism on and off in openmp. Anyone reading this not familiar with openmp can check many online sources. Here is a particularly useful summary of the syntax.
The idea borrowed from openmp is to use a preprocessor to set up the global history parameters. The preprocessor instructions would preceded by a comment mark (in python the "#" character) followed by a keyword to unambiguously separate such lines from a normal comment line. Might be clearer if I just made up a little examples:
#mspass algorithm=db.readwf instance=1 command="d=db.readwf(algorithm=$alg,algid=$algid) historyoff="d=db.readwf()
...
#mspass algorithm=filter instance=1 command="filter(d,"highpass",freq=1.0,algorithm=$alg,algid=$algid) historoff="filter(d,"highpass",freq=1.0)"
A preprocessor could just scan the job script and use the #mspass lines to define unique ids for each algorithm instance. The syntax of the above is not the point but the concept of using a preprocessor to handle the (like init) creation of data for the global history manager. The init could be run in multiple modes:
This model would require init be run before any run. There might be a less dogmatic way to do this, but I hope it illustrates the point of this approach; using a preprocessor to set up the history mechanism and turn it on and off. It addresses a related problem of providing a mechanism for a basic sanity check on the job script.
I also think the init-exec model works better, but instead of implementing a preprocessor, why don't we just use normal class/function calls? I think the equivalent of that example could be easily replaced with similar lines that calls a global history class. The problem with preprocessing is that Python is assumed to be interactive, so unless we do some sophisticated tweak, the preprocesser won't work, for example, in a Jupyter Notebook.
Actually, another thing that struck me is that since we want to make the history part optional, we might want to consider turning the ProcessingHistory
into a member of a TimeSeries
or Seismogram
. This will also resolve the name space conflict that Jinxin recently discovered - both Metadata
and ProcessingHistory
has a clear
method, and we actually cannot access the ProcessingHistory.clear
method in Python.
btw, it is true that there are a lot of papers about data provenance if you just do a google search. A higher level concept related is called Data lineage. I have not dive into that field before, but my understanding is that we are not really trying to achieve all the fancy functionalities that a lot of these papers are discussing (for example, people uses block-chain to ensure the provenance info is accurate). Maybe we should start with a literature review, or we can look at some of the implementations out there and see if there is any existing package that we can leverage. For example, a quick search led me to this package that seems interesting.
I looked into the provenance package and found that it is pretty much a generalized version of our object-level data provenance plus a global-level provenance managed with an object store. I think our global-level manager can have a similar API to that package, where we can use decorators to extract information like function name, process id, and host. Instead of letting users to define a storage for the provenance info, we just implicitly push it to the MongoDB. For the "creation is initialization" algorithms, I guess we can have a slightly different decorator that can apply to class methods. It can then pull the info from the class on how the object is initialized.
I agree that we don't even need to use the provenance package, we can write our own decorator pretty easily. I also find some useful packages to log into mongo: https://pypi.org/project/log4mongo/. In the decorator, we just need to log the information into mongo. But the user still needs to specify the job name or another attribute that we can use as the key.
As we have a working prototype for the object level history mechanism now, we are using this thread to start the process to design what we called the global level process manager. As discussed earlier on Zoom, we will start with listing all desired features, and then try to narrow it down to the realistic ones and then to the basic ones to start implementing with.
Right now, there are two major functionalities anticipated in this module: job level history management and workflow validation. For the history management, it will be used to assign and keep track of every algorithms executed within a job through assigning IDs and algorithm definitions. For the workflow validation, it will be following the similar concept seen in many seismic reflection processing software, where a sanity check is done before executing a long chain of data processing to ensure the metadata and data meets certain requirements for all the steps of algorithms.