spacepy / dbprocessing

Automated processing controller for heliophysics data
5 stars 4 forks source link

Finer-grained process queue handling #110

Open jtniehof opened 2 years ago

jtniehof commented 2 years ago

Right now ProcessQueue -p removes all files from the process queue, calculates all possible things to run for them, and then executes them all. There are a few problems with this:

So I'd like to have ProcessQueue -p handle the files on the queue in a "chunked" fashion.

Proposed enhancement

This is going to involve interweaving the calculation of the runMe objects and their execution, whereas before they've been pretty separate. So I'm guessing there needs to be a good round of code cleanup and testing in there. We probably also need some way of being able to maintain serialization.

Right now I believe we pull everything off the queue at once to check for redundant processing--if we can do a quick skim for redundancy first, and then pull one item at a time off the queue, that would make life a lot better. So currently it is:

  1. Retrieve all file_ids from the queue, emptying it
  2. Calculate all command lines (including some sort of redundancy check, will need to look into it)
  3. While command lines remain to execute: a. Start commands up to some process limit b. Wait for commands to terminate

New proposal is:

Perform a redundancy check/sweep of the process queue
While the process queue is not empty
    Pop one item from the database process queue (potentially move to "currently processing" queue, another table)
    Calculate its possible outputs and command lines (runMe objects), add to Python queue of runMe
    While there are runMe objects
        While the process limit is not reached
            Pop one from the runMe queue and start it
        Wait for a process to terminate

This does wind up in a fairly complicated producer-consumer relationship because an entry in the process queue potentially maps to multiple runMe objects (it may be an input for several processes.) The "currently processing" queue is tempting because it would help with crash recovery, but it would require some notation of the relationship between runMe objects and queue entries, so we know when an entry from the queue is complete (and can be removed). I haven't noted that bookkeeping above.

Alternatives

Speeding up the calculation of outputs would help in some regards, but not with the crashing question or the ability to do some level of start/stop.

We could also update the process queue to contain not files to process, but all their possible outputs, moving the calculation step to when a file is created instead of when it is processed. This might cause consistency problems, since adding new files would change the equation for files that have already been added, but perhaps not: ultimately the file is only used to calculate "process plus UTC file day" and storing that seems reasonable, as well as a very easy thing to check for duplicates. (Pull first item off queue, see if same process plus day exists later, if so just throw the first one away.) This would remove the one-to-many problem of process queue entries to runMe objects.

It would probably help this analysis to figure out how much processing is going into calculating possible outputs (what days and products can be made from a file) vs. calculating up-to-date (does this need to be run?) vs. calculating command lines. If calculating the command line is the slow part, then there could be a big win to moving it to the parallel bit--i.e. make the runMe very late in the game. And if the up-to-date check is fast, then having the same date/process combination on the queue multiple times isn't a big loss.

OS, Python version, and dependency version information:

Linux-4.15.0-163-generic-x86_64-with-Ubuntu-18.04-bionic
sys.version_info(major=2, minor=7, micro=17, releaselevel='final', serial=0)
sqlalchemy=1.1.11

Version of dbprocessing

Current master

Closure condition

Obviously this enhancement will be updated as I do research and design, but closure would be when we have a commit that pulls from the process queue in bite-sized pieces.