mspass-team / mspass

Massive Parallel Analysis System for Seismologists
https://mspass.org
BSD 3-Clause "New" or "Revised" License
30 stars 12 forks source link

Exception handling design for mspass #11

Open pavlis opened 5 years ago

pavlis commented 5 years ago

This emerged from ruminations on this topic by me (Pavlis) followed by the question to Ian about how spark handles a process that calls exit. His answer was that the entire job terminates - a bad thing for a massive job. Hence, we need a way to have a program handle most errors without calling exit. Part of that will be a coding standard, but for now the topic is the approach to use an API to implement the approach.

Some things known to not work:

  1. Simple error logging to stderr. The iostreams library is not thread safe and as a result multiple processes writing to stderr will frequently produce jumbled output.
  2. The same happens if logging to a file as in unix file io is no different that output to stderr.
  3. Might consider every process writing it's own log file, but that that could become a bookeeping mess with a confusing problem of how to find the right log file.

I think the clear solution for mspass is to make use of mongodb. That is, all processes should have an error log connection and the system should automatically save all log messages of varying severity in a mongodb collection. I think the concept maps easily to having documents in the collection that contain a minimum of these set of attributes: jobid processid objectid of data object being handled that created the log entry algorithm - the algorithm that was running on the data severity - describes how bad the problem was. antelope's elog provides a reasonable list. They use: log, notify, debug,complain, die.

We might consider two forms of die: zombify and abort, but that might create more problems that it solves. zombify would make a process stop running but wait for other workers to exit. Not sure that is smart. Probably cleaner to just reserve die for unrecoverable system problems like a malloc failure. Worth a serious discussion.

The secondary thing this solves is the verbose error log problem that plagues systems like antelope. There have been so many instances of issues that arose because an important program like their dbverify writes so much output it is very difficult to separate the wheat from the chaff. Putting all error messages as separate documents should solve this problem and create a thread safe log system. I think we may have mentioned this use of mongo in the proposal, but it is pretty clear to me that is the solution. Do we agree?

Now a key question is how programs should be designed to handle exceptions in mspass? I think this has two elements:

  1. The base error message thrown in c++ and python algorithms should have the error severity contained in the exception object for retrieval. Then error handlers will be able to handle different levels of errors differently.
  2. We need to either design a error logging object that any process running in the system should create or design a procedural set of functions like Antelope's elog. The later would make it easier for plain C programmers, but likely make the capability more limited.

The first is trivial, the second is a major task. I think, however, that if we are clever it could be abstracted as little different from adding new data to the database. An error message is just another data object.

Before I try to be more concrete in an API I'd like to get some feedback on this. Realize it is a lot to digest, but it is a very important issue.

wangyinz commented 5 years ago

That is definitely a good point, and error handling is indeed a tricky issue for large scale applications. I mostly agree that using database is a much better solution. It seems to me that you try to record every single line within a log as a MongoDB document. If that is the design, then I can imagine the performance will suffer as the delay of writing to a database is orders of magnitude greater than printing to stderr. A better design is probably having each log file (that has multiple lines) as the document. Then, we need to deal with the fact that there is a 16 MB limit on each JSON document. We may need to put the log files in GridFS instead. This way, the logs will be recorded to a log file on a local drive, and after the job is done, it will be send to MongoDB. The severity of the log will then be determined by the last line of the log file.

Also, if you run any code through Spark, its stdout and stderr will be redirected to a log file created by Spark. So, even if we don't have an elegant error logging mechanism right now, there won't be any thread safe issues, and the logs are still tractable.

pavlis commented 5 years ago

Very good point about performance with a (potentially large) number of db transactions.

This meshes with a different idea I had that could be treated as either enforced or just as guidance.

The idea is to define a basic error log message structure. We could start with a base class that would have these items: jobid,processid,severity,algorithm,message. We could add extra stuff as needed to the base class and let inheritance work out the extra debris if needed. The idea is to have the data objects being processed inherit an some we might call an error_log object. The model would then be that when errors of any kind are thrown the results are posted to the error_log object's data. error_log (or whatever we choose to call it) would also have getters and setters to define the state of the data. A basic list would be something like: ABORT,BAD,QUESTIONABLE,OK - maybe others. ABORT is of questionable merit but could be used to say this process cannot continue while others might be able to soldier on.

In this model when a user asks for data to be either saved or updated the writer would always test the data status and handle anything but OK as needed. Further, any time the log was not empty writes would initiate saving the errors to the "logs" collection (or whatever we chose to call it). That shouldn't create a size problem as any log file exceeding 16 MB indicates an improper logging method or an error that should abort the processing anyway - my opinion anyway. dbverify is an example were the size of the log is large, but if it were done in this framework would be manageable. dbverify runs a series of tests and a version of a similar entity for mspass would presumably run a series of tests each of which would generate it's own log.

Raises a point by the way - we will likely need to develop a set of applications to verify the integrity of a data set stored in mongodb to assure required attributes are available for all data to be processed. Way downstream from here though.

wangyinz commented 5 years ago

I think that is a very neat design. It actually echos with the provenance-aware feature advertised in the proposal. The logs are pretty much additional metadata that is attached to each procedure, so it make sense to have everything else inherit from it. In addition to the items you listed (jobid, processid, severity, algorithm, message), we might also add a parameter field that records the input parameter to the algorithm. We should probably just call this object log instead of error_log then.

pavlis commented 5 years ago

You are right that logging input parameters is very important, but I think it would be better logged to a different collection called something like input_paramters.

I'm starting on the error object part of this. Ran across an interesting C++ construct in C11 and higher.
struct DataError { typedef enum{ Fatal, Invalid, Suspect, Complaint, Debug, Informational } Severity; };

You can declare one of these as: DataError::Severity x;

then you can use x in a switch case like this: switch(x) { case Fatal: case Invalid: etc. }; Useful as otherwise an enum with a readable am can create name collisions very easily.