Closed MonsieurNicolas closed 5 years ago
I think that some of the implications of this are to ensure that advance
is the place that centralizes decisions on state transition and "run" would create more work if needed.
I did a bit of investigation of this looking for patterns and how the code it trying to use Work, so here are some thoughts:
First of all, just for clarity, I'm adding Work state machine represented as a diagram. Below are allowed state transitions we currently have:
Currently, decision of state transition is in two places: in advance
and in onSuccess
. It seems like onSuccess
was designed to manipulate state. And implementors heavily utilize that: for example, ApplyBucketsWork goes back and forth between "RUNNING" and "PENDING" states (depending if it's done applying a level or not). So I don't know if it makes sense to centralize decisions in advance
only.
The WorkParent-Work
relationship is not very intuitive to me; specifically, its separation of responsibilities. I feel like there are two different notions here:
onSuccess
, or advance
; with no dependencies advance is trivial)advance
has some rules)
To illustrate this point, I'll go back to what Work is trying to accomplish, from the code:
While 1 and 2 look to me like direct responsibilities of `WorkParent`, 3 and 4 are managed by `Work`. Note that some works in our code don't even need all of this. [ApplyLedgerChainWork](https://github.com/stellar/stellar-core/blob/06c5bac64f9865a707b79038141db52ad1db4f33/src/catchup/ApplyLedgerChainWork.cpp) is a good example: all it needs is a guarantee of 3 and 4.
Our codebase also has some patterns:
onStart
, while onRun
is overridden to be empty, e.g. VerifyBucketWork or RunCommandWork Is there a particular reason to do it like that?onSuccess
enforces order of work executed. PublishWork illustrates this: it first adds a child ResolveSnapshotWork
and goes into PENDING mode until the Work is done. In a similar fashion, it does WriteSnapshotWork
and PutSnapshotFilesWork
. This same pattern can be also found in CatchupWork and FetchRecentQsetsWork. addWork
is used liberally pretty much anywhere. Work is added in onReset
(BatchDownloadWork), onSuccess
(described above) or in both (GetAndUnzipRemoteFileWork)
Because of the the resulting work's flexibility, It became really difficult to track each implementer. Additionally, changing anything in the Work interface becomes dangerous, because it's really hard to assess how the change affects each variation on work. So some questions that come up to mind: Are we over-using work interface? It Work interface trying to achieve too much? Should we define a better "work-work parent" relationship?Let's use ApplyBucketsWork
as a way to discuss what is going on here for the RUNNING
state as it's actually a fairly simple Work
that doesn't spawn children.
What this does:
onReset
resets the overall Work (this seems to be the right thing to do)onStart
initializes a few members, preparing to apply the "current level"onRun
is used as a way to execute a unit of work, scheduling "success" at the endonSuccess
schedules run
(if the current level is not done), start
(if the current bucket is done and it's not done with all levels) or nothing (ie it's done) The problem I see here is that it's actually abusing the way Work
deals with state transition:
there is no reason to have the outer loop (go over levels) and the inner loop (apply buckets for a level) be implemented in different ways: those should be implementation details on how ApplyBucketsWork
manages what to do while it's "running".
We need to clearly define what each state is supposed to be.
Right now I don't know what WORK_PENDING
really means vs WORK_RUNNING
and what an implementor is supposed to really do in the callbacks (which should allow us to have proper separation of concern and centralize different aspects of scheduling/processing).
Per my comment for a simple work, I think that the only difference between WORK_PENDING
and WORK_RUNNING
is that onStart
gets called (but right now it's used for other purpose in advance
).
An example on how we can force people to implement things properly:
for onReset
, we can call it before any important transition, including when we tear down Work
, if we do this right, I think we can stop calling clearChildren
in all destructors (and instead assert in Work::~Work
if there are children still present).
I think that we can make working with children a lot cleaner with some changes to the state definitions.
I think that we could make everything work fine by just using WORK_RUNNING
even when we're waiting for all children to complete (notify
would not be used) but this would be inefficient (busy loop) as we would be constantly scheduling work to check if children are done.
So, instead, I think there should be a way for onRun
to tell the executor that they want to wait for a child to complete before being scheduled again (maybe introduce a new state WORK_WAITING
for this? it would also integrate neatly with event based work like RunCommand
).
I don't know if we need anything else than that (like we don't need a "wait for all children to complete") as done in the current implementation of advance
: it's very possible that instead it gets replaced by a helper function that gets called at the beginning of onRun
by implementors if they really want to wait for all children (but to your point, some people want to run things in parallel, some want to sequence, etc). There might be a need for another helper function to help fail if any children failed fatal for example.
Pattern would be something like:
ExecutionStatus MyWork::onRun() {
// helper function returns:
// DONE: all children done, no errors
// RUNNING: some children are still running
// FAILURE_*: some children failed
auto status = checkAllChildrenDone();
if (status == DONE)
{
// do something else
}
else
{
return status;
}
}
With this, I think notify
is just doing something like scheduleRun
of the parent (forcing decisions to be made by run
).
I agree the separation of concern is not great on this one.
I am not sure what the original intent of WorkParent
was:
Work
), yet it seems to be the class that deals with a "hierarchy" of work.The problem I see here is that it's actually abusing the way Work deals with state transition: there is no reason to have the outer loop (go over levels) and the inner loop (apply buckets for a level) be implemented in different ways: those should be implementation details on how ApplyBucketsWork manages what to do while it's "running".
This makes me wonder: why do we even allow work to put itself back into PENDING state (on the graph, that would be running->pending
state transition). Sure, we could change the current code to not abuse this transition, but it'll be hard to force future implementors to follow. It also seems like this state transition allowed for all kinds of different hacks in the code. Maybe Work shouldn't be allowed to re-start itself? Sounds like the rule of thumb should be "once you start work, you either stay in running state, or transition into a completed {success or failure} state", so there's no way back.
Per my comment for a simple work, I think that the only difference between WORK_PENDING and WORK_RUNNING is that onStart gets called (but right now it's used for other purpose in advance).
yes, that is my understanding as well. Though onStart
itself is confusing, because implementors do all kinds of things in onStart
(including actual work -- still not sure what's the rationale behind that)
for onReset, we can call it before any important transition, including when we tear down Work, if we do this right, I think we can stop calling clearChildren in all destructors (and instead assert in Work::~Work if there are children still present).
Are you assuming here the guarantee we discussed before that core won't shutdown until all work is in completed state? Can we also define "important transitions" that you are talking about? Transitioning into a completion state and retrying might be good candidates. Not sure if calling onReset
would still make sense when we add new work.
Let's try to define clear guidelines on which callbacks should be used for what:
onStart
- Initialize necessary variables, prepare Work for execution.
onRun
- do actual work, which might include adding children and start waiting for them.
onSuccess
- transition into failure or success state, or schedule next chunk of work to run. Is onSuccess
the right place to return new state? Perhaps it could be onRun
?
onReset
- reset Work to the initial state (should be the same as when we first added the work)
So, instead, I think there should be a way for onRun to tell the executor that they want to wait for a child to complete before being scheduled again (maybe introduce a new state WORK_WAITING for this?
I don't know if I agree on this one. This sounds to me like:
mChildren
in WorkParent are of type Work
and not WorkParent
(tree structure)Work
should be the base class, and not know about children. Instead, it would focus on proper state transitions. Meanwhile, WorkParent
manages children while also utilizing Work
's functionality.I agree that it may make sense to ditch the onStart
callback.
I also think that nobody should be calling SetState
outside of the Work
class.
Maybe Work shouldn't be allowed to re-start itself?
That's a good point: I think the only reason one needs to "restart itself" is to just wait for some outside event (and for that, a callback with for example a timer should be used), any other restart should be managed as a failure (that kicks off a restart).
Not sure if calling onReset would still make sense when we add new work.
onReset
would only be called when starting a new work or when tearing down work (failure/abort or shutdown)
onSuccess
- transition into failure or success state, or schedule next chunk of work to run. Is onSuccess the right place to return new state? Perhaps it could be onRun?
Mmm, yeah onSuccess
should not be called when running, I think the problem right now is that it's called when children complete. I think it's much easier to turn onRun
into the central place where implementors decide what to do next during execution.
I think onSuccess
, to your point, should only be called when Work
transitions to SUCCESS
(to unblock other work for example) and not allow it to change state. The reason it's nice to have it as its own callback is that it clearly separates code that should happen when done (same with onFailure
).
Also, if we do this, then onRun
is the only one returning the desired new state.
Executor becomes a centralized place that makes decisions on what is going to be run based on circumstances (e.g. waiting for child work to finish)
no, the executor does not care about what should be run: it always calls "run".
The only reason we need this WAITING
state is to avoid a busy loop: when in "RUNNING" state, the executor has to schedule "run".
The simplest way to think about this is that we first class support for waiting for a callback (triggered by a timer for example). As we first class callbacks, we can provide a wrapper that is the only way to invoke a Work
instance, that will basically do something like:
state == WAITING
RUNNING
run
btw, I noticed something strange in the code base: I see that we have a few places that do something like
mApp.getWorkManager().advanceChildren();
That seems fishy to me: why would a child have to "kick" the work manager? It's probably a work around for a hack of sorts...
So maybe Work should be the base class, and not know about children. Instead, it would focus on proper state transitions. Meanwhile, WorkParent manages children while also utilizing Work's functionality.
Yes, maybe with some points of clarification:
WorkParent
, adding the notion of hierarchy (ie, it's the one with addWork
)
If someone needs to add child work, they can simply derive from the WorkParent
or whatever we want to call it.
With all this: is there a reason to have the notify
functionality? The problem I see with notify
as it is now is that it's completely separate from run
(which then requires people to perform all sorts of dance to schedule things before or after children complete).
Maybe a way to fix this is when deriving WorkParent
implementors get to implement a onRun
callback that has an optional Work
(maybe an vector even?)? They don't get to implement onRun
as we would mark it "final" in WorkParent
.
Ok, from the offline discussion today, looks like I needed to make some changes to the design doc I've been working on that I'll post here -- it's almost done. Meanwhile, some clarifications:
With the round-robin approach to scheduling (specifically, where we schedule one work at a time), I wonder how this plays with batching. I guess it shouldn't be any different, except there might be other tasks scheduled in between batching works.
If onRun
becomes the only place that sets state, it should also probably be the only place where addChild
is allowed (more specifically, an implementer gets to implement a function inside onRun
that does actual work, and might add children). We probably shouldn't add children on places like onStart
, onSuccess
etc
btw, I noticed something strange in the code base: I see that we have a few places that do something like
mApp.getWorkManager().advanceChildren();
I think this comes from a fact that when work is added, nothing kicks it off, it's just kind of sitting there until advance
it called. So if something is added to the work manager (top level), it needs a manual kick. This sort of makes sense, because it allows for construction of a work tree, before scheduling anytning.
I suppose we could change that since in the new design added work goes straight into RUNNING mode. Thus it would be guaranteed to start running right away, so that no manual maintenance is needed.
doActualWork
)Disclaimer: we need better names for these 3 components: WorkFSM, Work (think WorkFSM that manages children), WorkManager
WorkFSM & Work (child of WorkFSM) state = {RUNNING, WAITING, SUCCESS, FAILURE_RETRY, FAILURE_RAISE, FAILURE_FATAL}
Note that there's no PENDING
state - any added work goes directly into RUNNING mode, which would mean it's either waiting for its children or actually running the work.
new WAITING
state is there to indicate that work shouldn't be scheduled yet. It would integrate with RunCommandWork
: currently it schedules its completion itself; with the new logic, we would provide a waiting callback that implementers could use, to set work back into RUNNING state so that work manager can schedule it again.
I think we can deprecate onStart
method - its no different than onReset
with the new logic, so onReset
could just handle variables reset/assignment
Here are some descriptions of the methods we'd use:
virtual void onReset()
void Work::State onRun()
{
// see explanations below for new methods
auto nextChild = yieldNextRunningChild();
if (nextChild)
{
nextChild->run();
}
auto nextState = doActualWork();
if (nextState == DONE)
{
complete(nextState);
}
else
{
setState(nextState);
notifyWorkManager();
}
}
std::shared_ptr
onRun
, basically if children are not finished yet, we would get the next child and tell it to run again)onReset
Work::State void doActualWork();
RUNNING
state to wait for them. notifyWorkManager
is basically scheduling a run of top level work which would then decide which work to schedule next according to RR logic.
Below are methods for scheduling running and completion of work
void
run()
{
if (state != WAITING)
{
// schedule yourself to run by posting a callback to IO service
// There might be additional stuff, but the callback would set state to `RUNNING` and call `onRun`
}
}
void
complete(State state)
{
assert(state == DONE) // shouldn’t be called when state RUNNING/WAITING
if (state == SUCCESS)
{
// similarly to `run`, post to IO service a callback that would set state to SUCCESS then call `onSuccess`
}
Else if (state == FAILURE)
{
// similarly to `run`, post to IO service a callback that would set state to FAILURE then call `onFailure`
}
}
virtual void onSuccess()/onFailureRaise()/onFailureFatal()
notifyWorkManager
Additionally, as mentioned above, we introduce a new state WAITING
that would prevent work from being re-scheduled. Specifically we can introduce "wait" callback that would would be triggered by a timer and actually schedule running of work; So work can have a timer and use async_wait
with mWaitingCallback. In this case code would look something like:
void
Work::wakeUp()
{
assert(getState() == WAITING);
setState(RUNNING);
notifyWorkManager(); // let WorkManager know it’s time to schedule run again
}
std::function<void(asio::error_code ec)> mWaitingCallback
{
mDone = true;
mErrorCode = ec
wakeUp(); // wake up and schedule completion
}
void Work::State doActualWork()
{
if (!Started)
{
Started = true;
mTimer.asyncWait(mWaitingCallback);
return WAITING;
}
else if (mDone)
{
return mErrorCode ? FAILURE : SUCCESS);
}
}
WorkParent
was keeping track of Work
in mChildrenWorkManager : public Work (potentially WorkScheduler?)
onRun
method as in WorkFSM, except doActualWork
would be just a no-op.In general, I would prefer that you first describe interfaces without any implementation details (ie public methods only).
Then, you can use those interfaces in
RunCommand
(event based), ApplyBucketsWork
(multi-part work) and GetHistoryArchiveStateWork
(work with children)I think they should be:
WorkFSM
-> BasicWork
(or WorkBase
)
Work
stays the same
WorkManager
, yes WorkScheduler
is probably more accurate
BasicWork
You should start by describing the BasicWork
interface.
I would recommend exposing a crank
function that internally invokes doRun
(if state is RUNNING
). That way, we can expand to other state transitions in the future (like doAbort
).
I would expect wakeUp
to be implemented at this level and not have a dependency on WorkScheduler
.
Also, you need to define when people are supposed to call wakeUp
.
Work
You didn't actually describe the Work
interface (well it's all mixed up with implementation detail); I suspect that once you have BasicWork
, it will become easier for you to do so.
Also, I imagine that there are Work
specific methods that would be declared as part of the interface, things like addWork
?
Also, void Work::State onRun()
method doesn't make sense to me:
BasicWork
)Work
std::shared_ptr yieldNextRunningChild() This method would be used to round-robin child scheduling (in onRun, basically if children are not finished yet, we would get the next child and tell it to run again)
I would prefer the round robin set to be { child_1, ... child_n, self }, that way we only perform one unit of work at a time (with what you wrote it's possible that we perform work in both a child and through doActualWork
For this we can use an iterator that is re-initialized in onReset
This doesn't make sense to me, why would we call onReset
when work is continuously running?
Note that WAITING children would not be scheduled
Not sure I understand "scheduled" here, I think you just mean we don't call crank
on WAITING
state.
Should be able to remove finished work tree from its children upon its success. For this we can have a slightly modified method that checks state of children - if it's SUCCESS, that work tree can be removed.
I think we should have a clear separation of concern between the 3 classes.
The only thing special about WorkScheduler
is that it schedules crank
calls: removing children sounds like a responsibility of Work
(if people want to keep children that completed around, they can just keep references in local member variables)
One thing I'm realizing, looks like in this design we would have top work constantly running constantly running (even if work manager has no children); I guess it's an unlikely scenario when WorkManager has nothing to do, but something to keep in mind.
I am not sure I am following, this is the purpose of the WAITING
state
Okay, will update based on the feedback.
what we discussed earlier was to never set the state directly and return the desired new state instead (this is the responsibility of BasicWork)
Ok, I think it should be more clear once I separate BasicWork and Work
it's unclear to me what implementors are supposed to do when they derive from Work
Like I mentioned earlier, doActualWork
would Provide work logic (e.g. applying buckets, publishing snapshots) as well as hint the next desired state (returned in doActualWork)
This doesn't make sense to me, why would we call onReset when work is continuously running?
What I meant is that when we transition into important states like retry or failure, child iterator would be reset.
Not sure I understand "scheduled" here, I think you just mean we don't call crank on WAITING state.
Yes, essentially child in WAITING state gets skipped.
I am not sure I am following, this is the purpose of the WAITING state
I was talking about WorkManager
in particular. With no children, if would keep cranking (not the case currently; hence, the manual kick as soon as we add a new work tree to work manager). Based on definitions we discussed, I wasn't sure if WAITING
is what we want as WorkManager
isn't really blocked on anything.
public:
enum State
{
WORK_RUNNING,
WORK_SUCCESS,
WORK_WAITING,
WORK_FAILURE_RETRY,
WORK_FAILURE_RAISE,
WORK_FAILURE_FATAL
};
void reset(); // reset work to its initial state
private:
// only BasicWork deals with state transitions; it also controls IO service
void setState(State newState);
void run(); // see tentative implementation below
void complete(State state); // set appropriate completion state and notify scheduler
void scheduleRun(); // post callback to `run` to IO service
void scheduleComplete(); // post callback to `complete` to IO service
// ...Retries helper functions would also go here; they are very similar to what we have now... //
protected:
State mState{RUNNING};
// Note that these callbacks are protected
// I don't think there's a good reason for letting outside classes access them
virtual void onReset();
virtual void onFailureRetry();
virtual void onFailureRaise();
virtual void onSuccess();
// Implementers provide work logic used in `onRun`, that hints BasicWork what state to transition to.
virtual State doActualWork() = 0;
virtual State onRun( return doActualWork(); );
std::function<void(asio::error_code ec)> mWaitingCallback {...};
// whenever implementers decide to return WAITING state, they need to make sure
// to define an event that re-trigger running (call `wakeUp`). For example, in case of a timer,
// implementers can use a generic waiting callback defined above.
void wakeUp();
void crank(); // schedule proper action based on state (`run`, `retry`, `abort` etc...)
// at this level BasicWork would just crank itself, since there is no notion of work
// manager/children yet, but this kind of weird so we'll let `Work` implement it.
virtual void notifyScheduler() = 0;
BasicWork::run
would be the method that decides state transitions and could look something like:
void BasicWork::run()
{
auto nextState = onRun();
If (nextState == DONE)
{
scheduleComplete(nextState); // complete will notify work manager
}
else
{
setState(nextState);
notifyScheduler(); // still running or waiting, tell scheduler to crank again
}
}
public:
template <typename T, typename... Args>
std::shared_ptr<T>
addWork(Args&&... args) { ...same as current code... };
protected:
std::vector<std::shared_ptr<Work>> mChildren;
std::weak_ptr<Work> mScheduler; // main scheduler that schedules cranks
// remove completed children, additionally implementers can add new children if needed (e.g. BatchWork)
virtual void manageChildren();
virtual std::shared_ptr<Work> yieldNextChild(); // if child is returned, call crank on it, otherwise crank self
State onRun() override final; // do slightly more in `onRun`, as now we manage children
State doActualWork() override; // may additionally check if children are done
void notifyScheduler() override
{
// get shared_ptr for mScheduler
mScheduler->crank();
}
// ..various helper methods to check state of children.. //
With this, onRun
could tentatively look something like:
State onRun()
{
manageChildren(); // Add more work if needed
Auto child = yieldNextChild();
If (child == nullptr) // run self
{
return doActualWork();
}
else
{
if (child->getState != WAITING)
{
child->crank();
}
return RUNNING;
}
}
I think we all the functionality described above (how WS gets notified and children removed), there isn't much for WorkScheduler
to do except for ensuring it's not busy-waiting for an event. My main concern is that we might end up with a lot of redundant work being performed. In a tree-like work structure it works okay, because if a leaf work is WAITING, some other work just gets scheduled instead. But if we have a linked list structure, it gets inefficient. So perhaps WorkScheduler
could return WAITING state in doActualWork
which then gets unblocked by either addWork
or mWaitingCallback. This would work for both cases where scheduler has not children, or all of its children are in WAITING state.
With this, some of the example works would look tentatively like this:
RunCommandWork
similarly to what I mentioned before, use timer and waiting callback in doActualWork
, scheduler does not schedule this work while it's in WAITING state.GetHistoryArchiveStateWork
, child work GetRemoteFileWork
is checked and added in manageChildren
, then in doActualWork
mState gets updated given that added child completed successfully. ApplyBucketsWork
, manageChildren
is a no-op, doActualWork
is where all the fun happens; derived class controls progress of buckets application with some internal variables and returns proper state in doActualWork
.So, implementers have to:
Some other questions/concerns:
BasicWork
to just work on its own. It's a little weird with notifyScheduler
, since BasicWork is its own scheduler. I previously had notifyScheduler
just do crank
, but I think it's confusing. Instead, we can have Work
implement the method, and have all classes derive from Work
. BasicWork
can you expand comments on what the various methods are?
I would like the split between public and protected to be:
BasicWork
functionality. As BasicWork
is supposed to be used to perform some work, I would expect the main crank
method to fall into that bucket for example. probably a virtual destructor as well.reset
public? Who calls it? (and why?)State
public? I think it should, but right now you don't expose the current state...onXYZ
methods called? Write those so that implementors know what they are supposed to doonXYZ
methods? (as they are not pure virtual)doActualWork
? It doesn't seem to make sense at this layermWaitingCallback
? I don't think the WAITING
state requires such callback. notifyScheduler
seems to be the wrong construct. Consumers of work are not going to derive from it (goes back to properly splitting public from protected members). At this layer, the API exposed should only be to perform some work (and perform other tasks like managing life times and state), scheduling is delegated to the consumer (caller).Work
Same thing: split public vs protected as it's unclear what a user of Work
would do vs an implementor.
More questions:
mChildren
should not be exposed (ie, should be private), I already wrote earlier that if people really want to keep track of their children, they should do so in a separate variable. In order to have deterministic behavior, mChildren
here should be only managed by the Work
class.mScheduler
is an abstraction layer break, this needs to be solved by properly defining the BasicWork
interfacemanageChildren
I don't understand why we need another mechanism to perform work. Spawning children is just something that people can do as part of their custom "run" implementation.yieldNextChild
, there is no reason to allow people to override this (the only implementation we're going to have is to implement round robin)Public:
enum State
{
WORK_RUNNING,
WORK_SUCCESS,
WORK_WAITING,
WORK_FAILURE_RETRY,
WORK_FAILURE_RAISE,
WORK_FAILURE_FATAL
};
BasicWork(Application& app, std::string uniqueName, size_t maxRetries, std::function<void(State state)> notifyCallback);
virtual ~BasicWork();
State getState();
// to avoid confusion with VirtualClock::crank, let’s call it `crankWork`
void crankWork();
Protected:
// Note that these are empty by default, I think
// implementers should not be forced to implement them
virtual void onReset() {};
virtual void onFailureRetry() {};
virtual void onFailureRaise() {};
virtual void onSuccess() {};
virtual State onRun() = 0;
// Reset is exposed at this level, as `Work` should also be able to reset when adding children
void reset()
{
// .. Work might reset some internal variables here.. //
onReset(); // whatever else implementers need to do (e.g. clean up children)
}
void wakeUp()
{
assert(mState == WAITING);
setState(RUNNING);
mNotifyCallback(mState);
}
// This is a callback that implementers provide that will do whatever needed
// internally then notify its parent
std::function<void(State stateOfChild)> mNotifyCallback;
Private:
State mState{RUNNING};
// Note that there are only 2 places where setState is called: crankWork and wakeUp
void setState(State newState);
// Retries helper functions would also go here; they are very similar to what we have now,
// except I think we could incorporate WAITING state for the retry delay,
// and wakeUp when it's time to retry
// Additionally, retry mechanism should be internal to BasicWork
crankWork
could look something like this
void crankWork()
{
// We shouldn’t crank on work that is finished or waiting
assert(getState() != DONE && getState() != WAITING);
// For now, this is only for running state, we can add if condition for aborting
auto state = onRun();
setState(state);
if (state == FAILURE)
{
switch(state)
{
Case RETRY:
onFailureRetry();
Case RAISE:
Case FATAL:
onFailureRaise();
}
reset(); // note work is reset on any failure
}
else if (state == SUCCESS)
{
onSuccess();
}
// completed a unit of work
mNotifyCallback(state);
}
What everybody can do: start running work and check on its state
What implementers can do: implement onXYZ
callbacks to do whatever they like
What only BasicWork
can do: transitions states
Public:
Work(Application& app, std::string uniqueName, size_t retries, std::function<void(State state)> callback) : public BasicWork(app, uniqueName, retries, callback), ...
virtual ~Work( ...ensure children are properly cleaned up... );
template <typename T, typename... Args>
std::shared_ptr<T>
addWork(Args&&... args)
{
Auto child = make_shared<Work>(...);
addChild(child);
mNotifyCallback();
}
// I imagine we could also have a `addBatchOfWork` (with a better name),
// so that we notify parent once.
// ..Various methods to check the state of children..
Protected:
State onRun() override final
{
auto child = cleanUpAdnYieldNextChild();
if (child)
{
child->crankWork();
return RUNNING;
}
else
{
return doWork();
}
}
// Implementers decide what they want to do: spawn more children,
// wait for all children to finish, or perform work
virtual State doWork() = 0;
void onReset() override
{
clearChildren();
}
Private:
// remove completed children, round-robin next running child (skip `WAITING` children)
std::shared_ptr<Work> cleanUpAdnYieldNextChild();
std::vector<std::shared_ptr<Work>> mChildren;
// Update `mChildren`, reset child work
void addChild(std::shared_ptr<Work> child);
What everybody can do: add children, check overall children progress
What implementers can do: refer to comment for doWork
What only Work
can do: manage mChildren, round-robin over children
WorkScheduler : public Work
{
public:
WorkScheduler(Application& app);
virtual ~WorkScheduler();
protected:
// When there are no children, all they all are in WAITING state,
// go into waiting state as well until event notification or new children
State doWork() override
{
Return WAITING;
}
}
A callback for the scheduler would be slightly different, something like:
auto maybeScheduleCrank = [self](State state)
{
// children could all be in WAITING state, avoid busy loop
if (self->anyChildrenRunning())
{
if (self->getState() == WAITING)
{
self->wakeUp();
}
else
{
IO_service.post(self->crankWork());
}
}
}
Additionally, I think we might need a static create method for WorkScheduler. We first need to instantiate scheduler class (I guess with nullptr
for mNotifyCallback), then assign a callback to schedule crank (this is not the case for regular child Work
because at class instantiation the parent is known, hence mNotifyCallback
can be supplied). I also think that once WorkScheduler is properly setup, it should start cranking right away, so that there's no manual trigger required. (which should work fine even when there's no work to do since it'll go straight into WAITING state)
ApplyBucketsWork
State doWork() override
{
// customized behavior for this work
auto finishStatus = applyBuckets();
if (stillApplyingBuckets())
{
return RUNNING;
}
else
{
return finishStatus == success ? SUCCESS : FAILURE;
}
}
GetHistoryArchiveWork
State doWork()
{
if (!mGetRemoteFile)
{
mGetRemoteFile = addWork<GetRemoteFileWork>...;
Return RUNNING;
}
else
{
if (mGetRemoteFile->getState() == SUCCESS)
{
loadFile();
return SUCCESS;
}
else
{
// decide how to proceed if child failed; in this particular case, fail too
return FAILURE;
}
}
}
// A callback defined for children could be like this (passed to `GetRemoteFileWork` in this case)
std::weak_ptr<Work> self = ...get weak ptr…;
auto parentNotify = [self](State state)
{
// Do whatever derived class wants to do internally
// to react to `state`
// get shared ptr for self
self->mNotifyCallback(state); // propagate up
}
Overall, looks reasonable now. Some questions and objections:
crankWork
call mNotifyCallback
? crankWork
is always called by the parent (except in the case of the WorkScheduler
), so the parent must be RUNNING
if the child is RUNNING
. So there is no reason to notify the parent in this case, since it can and should take any relevant action in response to the state change the next time it is scheduled for crankWork
. This question is a consequence of the fact that notification should only be used to wake up a parent.wakeUp
assert that the state is WAITING
? As a counter example, suppose that new child work is added to an already running parent. The child doesn't necessarily know that its parent is running (because BasicWork
has no access to the parent) so it must attempt to wake its parent.Work::onRun
captures the right semantics. Work::onRun
should round-robin over its children and self, since self may have to respond to the changing state of its children. I think it would be helpful if you added more comments to the interface which define in what situations any given function can/should be called.
Several modifications based on the feedback (pseudocode, might need to be simplified):
mNotifyCallback
could be removed out of crankWork
, and appear on the scheduler level to schedule the next crank. With that, a callback for the scheduler could look something like:
auto maybeScheduleCrank = [self](State state)
{
// children could all be in WAITING state, avoid busy loop
if (self->anyChildrenRunning())
{
// This could happen if `addWork` is called while scheduler is waiting
if (self->getState() == WAITING)
{
self->wakeUp();
}
else
{
IO_service.post(
{
self->crankWork();
self->mNotifyCallback();
});
}
}
}
addWork
should rather look like this:
template <typename T, typename... Args>
std::shared_ptr<T>
addWork(Args&&... args)
{
Auto child = make_shared<Work>(...);
addChild(child);
wakeUp(); // ensure to set state to RUNNING
}
onRun
would roughly be:
// Assuming `mScheduleSelf` is initialized to “false”
State onRun() override final
{
If (mScheduleSelf || !hasRunningChildren())
{
mScheduleSelf = false;
return doWork();
}
else
{
mScheduleSelf = true;
auto child = cleanUpAndYieldNextChild();
child->crankWork();
return RUNNING;
}
I'm working on adding more comments as well as an initial implementation.
@MonsieurNicolas @jonjove
Additional clarifications, and some learnings from the initial implementation of work interface:
mNotifyCallback
as a way for any Work
to alert its parent of an event. It appears that in most of the cases so far (in fact, in all but one, which I will cover below) invoking wakeUp
on a parent is sufficient. Now, I think we should make a clear definition whether we expect the parent to exist and not be destroyed prior to its children being destroyed. I recall we discussed that aborting logic should give us this guarantee. The reason I'm bringing this up is because it will influence some implementation choices (e.g. using a weak pointer of a parent in a callback passed to a child, and maybe throw if parent doesn't exist; right now we just silently proceed if parent doesn't exist). WorkScheduler
: for the scheduler, "waking up" means checking its status and potentially posting more stuff to the IO queue, then "notifying" itself. I'm assuming that work scheduler can only be in two possible states when application is running: running or waiting (later aborting should be added too). Scheduler depends on the state of its children, so when we create work scheduler via a static create
method, we assign it a custom mNotifyCallback
that checks condition and takes an appropriate action as follows:Scheduler state | Children state | Scheduler action |
---|---|---|
RUNNING | anyChildrenRunnning() | crankWork() , will run more children work |
RUNNING | !anyChildrenRunnning() | crankWork() , will go into WAITING |
WAITING | anyChildrenRunnning() | wakeUp() |
WAITING | !anyChildrenRunnning() | no-op |
Actual callback:
mNotifyCallback = [weak]() {
auto self = /* ...get shared_ptr for weak... */
if (self->getState() == BasicWork::WORK_RUNNING)
{
if (self->mScheduled)
{
return;
}
self->mScheduled = true;
self->mApp.getClock().getIOService().post([weak]() {
auto innerSelf = /* ...get shared_ptr for weak... */
innerSelf->mScheduled = false;
innerSelf->crankWork();
innerSelf->mNotifyCallback();
});
}
else if (self->getState() == BasicWork::WORK_WAITING &&
self->anyChildRunning())
{
self->wakeUp();
}
};
Since the scheduler depends on the children, WorkScheduler::doWork
would look like this:
State doWork() override
{
if (anyChildRunning())
{
return BasicWork::WORK_RUNNING;
}
return BasicWork::WORK_WAITING;
}
The above aligns with expected action in mNotifyCallback
.
Also, it might be worth noting that we use shared_from_this()
frequently in the current Work interface with an assumption that there is always an existing shared_ptr (i.e. parent creates and manages child). This makes me wonder: does it ever make sense to create new Work
NOT via addWork
? I don't think we use anything else than addWork
when creating work trees in the code, even in the tests all work is created with application's WorkManager
. So maybe it makes sense to restrict work creation to addWork
.
Retries: I briefly mentioned retries before, but I think it would work as follows: the next desired state is obtained in crankWork
, if it's a retry, we can go into a WAITING state and employ a mechanism similar to the one described for RunCommandWork
. Specifically, using a timer, work waits some time, then the expiration of timer triggers wakeUp
, which transitions the work into RUNNING state, and propagates the notification, so the work can be retried. We should also probably have some consumer-facing states for clarity, e.g. RETRYING
or something.
Also, few more thoughts to see how abort
logic fits into the design. It will be added once the initial design implementation is in place.
Like I already mentioned before, there are really two abort triggers: external and internal. An example of an external trigger is system shutdown, where an abort signal is issued to the work scheduler. An internal abort would be the following scenario: parent work A has some children. Consider one of the children goes into FAILURE_RAISE mode. Next time A is scheduled, it needs to take proper action to ensure the rest of the children are aborted. In order for this to happen, a few things to consider:
Work
children rely on a signal they got from the parent (could be a bool member variable), and act accordingly to abort - this should be on Work
level to avoid having implementers deal with abort mechanism. (Thought they'd need to implement onAbort
- I think most of the time it would just be doing nothing; work interface resets any work in terminal failure state anyway, so as long as they provide proper onReset
, it should work fine. For classes like RunCommandWork
, things would be slightly more complicated since it might need to issue a kill
to whatever process the work is waiting on (I also suspect that actually waiting for kill
to complete might take quite a bit of time, a timeout/force kill/etc should also probably be added.onAbort
.
It's quite unclear for implementors what the various methods (onReset, onStart, etc) are really supposed to do, especially when combining Work (what
WorkParent
does for example).For example, I would expect implementations to follow:
onReset
initializes the Work to a pristine state (does NOT create work)onStart
should be the one creating the initial batch of work (also called on retry as retry triggers in sequence reset, advance, start)This is illustrated by
WorkParent
that doesn't manage its children at all: it's up to all implementors to remember to callclearChildren()
on reset for example (or face undefined behavior on retry).