Action to destination schedule dataflow unsafe and difficult to use

hmh commented 5 years ago

The current data flow from an action (in schedule A) to destination(s) in schedule B... is:

If an action returns non-zero, its results are destroyed
If an action returns zero, and has destinations, its output is moved to the destination schedule(s) "base directory", regardless of whether that schedule is already running or not.
There is no data life-cycle management implemented to deal with consumed data by an action

This really gets in the way of implementing resilient reporting of results to a collector. The desired report flow is: render the report, optionally compress it, and if this fails (e.g. storage full), don't remove the source data. After the report is rendered successfully, remove only the source data that is present in that report, and queue the report for transmission. The report is only removed when the transmission succeeds. The render+transmit steps may be implemented atomically, and then it becomes "the source data present in the report must only be removed when the transmission succeeds".

Proposed action -> schedule data-flow:

Action output is sent to an "incoming" queue on each schedule (it can continue to be the base directory of the schedule.
Data can arrive at the "incoming" queue of a schedule at any time, including while that scheduling is already running. Either advisory locking or "write then rename" strategies must be used to ensure no "live" pair of (data, meta) files exist. This is already true as implemented.
When the runner will start processing a schedule, it moves every pair of (meta, data) files to a "processing" queue of that schedule.
actions must consume data only from the "processing" queue of their schedule.
If, and only if, every action of a schedule returns a zero status, the "processing" queue is emptied by the runner when the schedule finishes running. If any of the actions return a non-zero status or no action exists, the "processing" queue is left alone (i.e. accumulates data).

The SIMET team will implement the proposed action above, and it will (as usual) be made available for merge upstream in our fork. If we find any problems with the proposed solution, we will edit this issue report accordingly.

hmh commented 5 years ago

Implementation ready, it is undergoing testing and we will submit it as a PR soon (it doesn't depend on any pending PRs).

hmh commented 5 years ago

One border condition: if a schedule has no actions, it currently will accumulate input in its storage (because it doesn't actually "run"). We have not changed that behavior in our soon-to-be-submited PR, we basically considered it undefined for now.

hmh commented 5 years ago

On sequential schedules, it may be useful to allow one action to act on the output of a previous action (sort of like the still not-implemented pipelined mode, but not really).

We have special cased an action that has its own schedule listed as its output destination, so that the output is placed directly into that schedule's processing queue and thus it is immediately available for the next action.

Without the special casing, the output would be directed to the incoming queue of the schedule, and would be available only on the next execution of the whole schedule.

schoenw / lmapd

Action to destination schedule dataflow unsafe and difficult to use #13