mspass-team / mspass

Massive Parallel Analysis System for Seismologists
https://mspass.org
BSD 3-Clause "New" or "Revised" License
28 stars 12 forks source link

Design of Global Level Manager #61

Open wangyinz opened 4 years ago

wangyinz commented 4 years ago

As we have a working prototype for the object level history mechanism now, we are using this thread to start the process to design what we called the global level process manager. As discussed earlier on Zoom, we will start with listing all desired features, and then try to narrow it down to the realistic ones and then to the basic ones to start implementing with.

Right now, there are two major functionalities anticipated in this module: job level history management and workflow validation. For the history management, it will be used to assign and keep track of every algorithms executed within a job through assigning IDs and algorithm definitions. For the workflow validation, it will be following the similar concept seen in many seismic reflection processing software, where a sanity check is done before executing a long chain of data processing to ensure the metadata and data meets certain requirements for all the steps of algorithms.

pavlis commented 4 years ago

Here are some thoughts on what mspass needs for global level management. Order is rough the priority of importance for functionality:

  1. We need a system to verify the integrity of all input data. I think for us that means a test all the time series can be reconstucted successfully and all required metadata are defined for each of the seismograms to be read.
  2. We need a system for detecting any errors in the processing script. Since we are using python that should allow us to adapt common tools for checking scripts for errors.
  3. We need some level of estimating output disk volume for any intermediate saves and any final output saved data. That is a component of a component needed to verify that a job will not eat up all local data storage.
  4. We need a standard mechanism to time each processing element to get statistics on time to process each object. Essential for testing, but essential to a user trying to estimate how a problem will scale from a trial data set - the only sensible way to prepare for a massive processing job.
  5. Job scripts need to be stored automatically in mongodb and assigned a unique id that should be attached to all object level history data. Our current API provides for this, but the system needs to automate that. It may just require a standard python function call at the top of each python main script.
  6. It would be nice if we could achieve some level of functionality of the "init" concept used in all older (if not all modern) seismic reflection packages. The init program for us would be a one line command to do some of the functions above. An "exec" as used, for example, in promax would be the python script to be run.
  7. In the old days processing system software had to consider a more bewildering array of decisions about the availability of resources. A job might demand 6 magnetic tape drives, for example, and access to a raster plotting device and systems were built to do sanity checks on availability. In an industrial setting that would be necessary when someone blew off a dusty deck of cards to reprocess data from years back. Today most of that is abstracted, but there are some basic things like available memory, available processors, and available disk space that will still be an issue. Much of this is what spark handles, but generic resource management is a clear global management issue. This is down the list for a research system since we don't want a cumbersome resource manager to get in the way to slow progress.
  8. There needs to be a mechanism for a sanity check on any potential graphical output. Consider the consequences if some rookie graduate student released a job that would plot a million seismograms in a workflow putting up one new plot window for each seismogram.

Those are the first things that came to mind.

pavlis commented 3 years ago

Driving 4000 miles mostly on interstate gave me some time to think about a lot of things. One I thought about was how the global manager would work. I am thinking this is also related to how we define what a "processing module" or "processing function" is? The point I want to make here could have appeared in a different issues section, but because it is connected to our ideas of global history I decided to put i here. I'm reducing this for now to the concepts I have in mind.

The concept we need to implement is an abstraction of what a processing module/function is and how it should behave. Then wrappers for a particular algorithm can be thought of an implementation of the base class. Here in words are what I see as the functionality for this concept:

  1. Construction should initialize an instance of an algorithm.
  2. The class should contain a loaddata method to load data that can be defined by a single data object (e.g. Seismogram or ThreeComponentEnsemble).
  3. Derived classes can add methods for other data. e.g. some deconvolution algorithms need a loadnoise method for preevent noise.
  4. I think to work properly with spark the class will need a process method (not apply like your decon code as that does something different). process would receive input data as a required argument and return one of the supported data objects. That would work with spark map.
  5. We need a comparable method to work with a spark reduce call - I'm ignorant at this point and realize I don't exactly know what that entails.
  6. We should also allow the load/apply model used in your decon code. Unclear if that should be in the public api. In fact, on reflection maybe load should also not be in the public api.
  7. the parameters that define an instance must be immutable. i.e. once created a processing module must not be allowed to change it's state. To do so would completely break the entire processing history mechanism. If the parameters change it must somehow create a new instance. This is where the global history comes in. It may be necessary for a processing module to have a hook to the global process manager so that if it changes it's state it would be automatically registered with the manager. Point of discussion.

This is incomplete, but hopefully can start this discussion. Reading back I recognize what I wrote may not be very clear. Call this the opening salvo on this topic.

pavlis commented 3 years ago

In today's zoom call we decided to visit this issue (Global History Manager) in next week's call. I said then and emphasize again a good way to iron out some of between now and then is to post some dialogue for the record in issues section. This isn't as volatile as the database api design was so I think it can be handled here reasonably well. The pieces above are sketchy and some of the things I said earlier I now realize were based misconceptions of spark that I don't have any longer, although I may have others.

Anyway, I can think of two generic approaches for setting up global history (are there others?):

  1. Above I advocated for some variant of the well established init-exec model used in seismic reflection processing. In that model, init would serve two functions: (a) validate inputs to processing functions as much as possible, and (b) construct, or at least assemble the data needed to construct, an instance of an object we could design called GlobalManager or something like that.
  2. The global history could be constructed on the fly. In this model a full fledged MsPASS processing function/module/whatever (probably need a consistent name for the concept) would need to interact with the manager. How that interaction should happen is an implementation detail, but the idea there is that whenever a function/module is invoked it would ask the manager what it should set it's jobname, jobid, and algid to (presumably it knows it name and would ask the manager for that information. In this model the manager has to keep track of the processes she is supervising.

Are there other approaches? There is likely a literature on this topic in computer science, alhough likely hidden behind jargon.

pavlis commented 3 years ago

A related point from above that remains relevant and impacts the api for a global history manager. That is, we have examples already of two very different approaches to defining what an algorithm is:

  1. The function/subroutine process model. For defining state this approach sets the input parameters on every single call to the function. In this model the system has to know what parameters are fixed and what are changing. For a simple function like obpsy's bandpass filter one has a common set of parameters defined for all data processed. What do we do, however, if one wrote a python script that determined the parameters for some function from some other data and then set the call parameters differently? Every call could potentially then have a different instance/algid. We have to put some rules on function calls to avoid that pitfall.
  2. Both the deconvolution and the new Butterworth filter class I just created us a creation is initialization model and run an apply method to process a particular data object. That is a very different construct than the function model, and is actually a lot easier to manage.

This matters because they create very different constructs. Function calls always go like this:

d=data_creator()
...
d=functioncall(d,param1,param2, param3)

while *creation is initialization always works like this:

myproc=MyProcessor(p1,p2,p3)  # constructor for MyProcessor call initializes its state
...
d=data_creator()
...
myproc.apply(d)   # run the myproc algorithm on d - could also return a new object

The first is potentially more dynamic as noted above while the second is easier to set up with a simple history manager - it only needs to register itself on creation.

pavlis commented 3 years ago

Had another thought on this issue this morning. This particular idea would build on the init-exec model from seismic reflection processing and the approach turning parallelism on and off in openmp. Anyone reading this not familiar with openmp can check many online sources. Here is a particularly useful summary of the syntax.

The idea borrowed from openmp is to use a preprocessor to set up the global history parameters. The preprocessor instructions would preceded by a comment mark (in python the "#" character) followed by a keyword to unambiguously separate such lines from a normal comment line. Might be clearer if I just made up a little examples:

#mspass algorithm=db.readwf instance=1 command="d=db.readwf(algorithm=$alg,algid=$algid) historyoff="d=db.readwf()
...
#mspass algorithm=filter instance=1 command="filter(d,"highpass",freq=1.0,algorithm=$alg,algid=$algid)  historoff="filter(d,"highpass",freq=1.0)"

A preprocessor could just scan the job script and use the #mspass lines to define unique ids for each algorithm instance. The syntax of the above is not the point but the concept of using a preprocessor to handle the (like init) creation of data for the global history manager. The init could be run in multiple modes:

  1. dryrun mode would just check the syntax of each run line. That would require any mspass processing module to handle that parameter correctly.
  2. For the specific approach above command would be substituted if history preservation was not required.
  3. For the approach above the historyoff arg would be executed for running with history preservation.

This model would require init be run before any run. There might be a less dogmatic way to do this, but I hope it illustrates the point of this approach; using a preprocessor to set up the history mechanism and turn it on and off. It addresses a related problem of providing a mechanism for a basic sanity check on the job script.

wangyinz commented 3 years ago

I also think the init-exec model works better, but instead of implementing a preprocessor, why don't we just use normal class/function calls? I think the equivalent of that example could be easily replaced with similar lines that calls a global history class. The problem with preprocessing is that Python is assumed to be interactive, so unless we do some sophisticated tweak, the preprocesser won't work, for example, in a Jupyter Notebook.

Actually, another thing that struck me is that since we want to make the history part optional, we might want to consider turning the ProcessingHistory into a member of a TimeSeries or Seismogram. This will also resolve the name space conflict that Jinxin recently discovered - both Metadata and ProcessingHistory has a clear method, and we actually cannot access the ProcessingHistory.clear method in Python.

wangyinz commented 3 years ago

btw, it is true that there are a lot of papers about data provenance if you just do a google search. A higher level concept related is called Data lineage. I have not dive into that field before, but my understanding is that we are not really trying to achieve all the fancy functionalities that a lot of these papers are discussing (for example, people uses block-chain to ensure the provenance info is accurate). Maybe we should start with a literature review, or we can look at some of the implementations out there and see if there is any existing package that we can leverage. For example, a quick search led me to this package that seems interesting.

wangyinz commented 3 years ago

I looked into the provenance package and found that it is pretty much a generalized version of our object-level data provenance plus a global-level provenance managed with an object store. I think our global-level manager can have a similar API to that package, where we can use decorators to extract information like function name, process id, and host. Instead of letting users to define a storage for the provenance info, we just implicitly push it to the MongoDB. For the "creation is initialization" algorithms, I guess we can have a slightly different decorator that can apply to class methods. It can then pull the info from the class on how the object is initialized.

JiaoMaWHU commented 3 years ago

I agree that we don't even need to use the provenance package, we can write our own decorator pretty easily. I also find some useful packages to log into mongo: https://pypi.org/project/log4mongo/. In the decorator, we just need to log the information into mongo. But the user still needs to specify the job name or another attribute that we can use as the key.