stellar / stellar-core

Reference implementation for the peer-to-peer agent that manages the Stellar network.
https://www.stellar.org
Other
3.12k stars 970 forks source link

design cleanup: Work interface is a mess #1755

Closed MonsieurNicolas closed 5 years ago

MonsieurNicolas commented 6 years ago

It's quite unclear for implementors what the various methods (onReset, onStart, etc) are really supposed to do, especially when combining Work (what WorkParent does for example).

For example, I would expect implementations to follow:

This is illustrated by WorkParent that doesn't manage its children at all: it's up to all implementors to remember to call clearChildren() on reset for example (or face undefined behavior on retry).

MonsieurNicolas commented 6 years ago

I think that some of the implications of this are to ensure that advance is the place that centralizes decisions on state transition and "run" would create more work if needed.

marta-lokhova commented 6 years ago

I did a bit of investigation of this looking for patterns and how the code it trying to use Work, so here are some thoughts:

MonsieurNicolas commented 6 years ago

Multi-part work

Let's use ApplyBucketsWork as a way to discuss what is going on here for the RUNNING state as it's actually a fairly simple Work that doesn't spawn children.

What this does:

The problem I see here is that it's actually abusing the way Work deals with state transition: there is no reason to have the outer loop (go over levels) and the inner loop (apply buckets for a level) be implemented in different ways: those should be implementation details on how ApplyBucketsWork manages what to do while it's "running".

State definitions and callbacks

We need to clearly define what each state is supposed to be. Right now I don't know what WORK_PENDING really means vs WORK_RUNNING and what an implementor is supposed to really do in the callbacks (which should allow us to have proper separation of concern and centralize different aspects of scheduling/processing).

Per my comment for a simple work, I think that the only difference between WORK_PENDING and WORK_RUNNING is that onStart gets called (but right now it's used for other purpose in advance).

An example on how we can force people to implement things properly: for onReset, we can call it before any important transition, including when we tear down Work, if we do this right, I think we can stop calling clearChildren in all destructors (and instead assert in Work::~Work if there are children still present).

Work with children

I think that we can make working with children a lot cleaner with some changes to the state definitions.

I think that we could make everything work fine by just using WORK_RUNNING even when we're waiting for all children to complete (notify would not be used) but this would be inefficient (busy loop) as we would be constantly scheduling work to check if children are done.

So, instead, I think there should be a way for onRun to tell the executor that they want to wait for a child to complete before being scheduled again (maybe introduce a new state WORK_WAITING for this? it would also integrate neatly with event based work like RunCommand). I don't know if we need anything else than that (like we don't need a "wait for all children to complete") as done in the current implementation of advance: it's very possible that instead it gets replaced by a helper function that gets called at the beginning of onRun by implementors if they really want to wait for all children (but to your point, some people want to run things in parallel, some want to sequence, etc). There might be a need for another helper function to help fail if any children failed fatal for example. Pattern would be something like:

ExecutionStatus MyWork::onRun() {
    // helper function returns:
   // DONE: all children done, no errors
  //  RUNNING: some children are still running
  //  FAILURE_*: some children failed
    auto status = checkAllChildrenDone();
    if (status == DONE)
    {
        // do something else
    }
    else
    {
         return status;
    }
}

With this, I think notify is just doing something like scheduleRun of the parent (forcing decisions to be made by run).

Work vs WorkParent

I agree the separation of concern is not great on this one.

I am not sure what the original intent of WorkParent was:

marta-lokhova commented 6 years ago

Work FSM

The problem I see here is that it's actually abusing the way Work deals with state transition: there is no reason to have the outer loop (go over levels) and the inner loop (apply buckets for a level) be implemented in different ways: those should be implementation details on how ApplyBucketsWork manages what to do while it's "running".

This makes me wonder: why do we even allow work to put itself back into PENDING state (on the graph, that would be running->pending state transition). Sure, we could change the current code to not abuse this transition, but it'll be hard to force future implementors to follow. It also seems like this state transition allowed for all kinds of different hacks in the code. Maybe Work shouldn't be allowed to re-start itself? Sounds like the rule of thumb should be "once you start work, you either stay in running state, or transition into a completed {success or failure} state", so there's no way back.

Per my comment for a simple work, I think that the only difference between WORK_PENDING and WORK_RUNNING is that onStart gets called (but right now it's used for other purpose in advance).

yes, that is my understanding as well. Though onStart itself is confusing, because implementors do all kinds of things in onStart (including actual work -- still not sure what's the rationale behind that)

for onReset, we can call it before any important transition, including when we tear down Work, if we do this right, I think we can stop calling clearChildren in all destructors (and instead assert in Work::~Work if there are children still present).

Are you assuming here the guarantee we discussed before that core won't shutdown until all work is in completed state? Can we also define "important transitions" that you are talking about? Transitioning into a completion state and retrying might be good candidates. Not sure if calling onReset would still make sense when we add new work.

Let's try to define clear guidelines on which callbacks should be used for what: onStart - Initialize necessary variables, prepare Work for execution. onRun - do actual work, which might include adding children and start waiting for them. onSuccess - transition into failure or success state, or schedule next chunk of work to run. Is onSuccess the right place to return new state? Perhaps it could be onRun? onReset - reset Work to the initial state (should be the same as when we first added the work)

Children management & scheduling

So, instead, I think there should be a way for onRun to tell the executor that they want to wait for a child to complete before being scheduled again (maybe introduce a new state WORK_WAITING for this?

I don't know if I agree on this one. This sounds to me like:

Work & WorkParent

MonsieurNicolas commented 6 years ago

Work FSM

I agree that it may make sense to ditch the onStart callback.

I also think that nobody should be calling SetState outside of the Work class.

Maybe Work shouldn't be allowed to re-start itself?

That's a good point: I think the only reason one needs to "restart itself" is to just wait for some outside event (and for that, a callback with for example a timer should be used), any other restart should be managed as a failure (that kicks off a restart).

Not sure if calling onReset would still make sense when we add new work.

onReset would only be called when starting a new work or when tearing down work (failure/abort or shutdown)

onSuccess - transition into failure or success state, or schedule next chunk of work to run. Is onSuccess the right place to return new state? Perhaps it could be onRun?

Mmm, yeah onSuccess should not be called when running, I think the problem right now is that it's called when children complete. I think it's much easier to turn onRun into the central place where implementors decide what to do next during execution.

I think onSuccess, to your point, should only be called when Work transitions to SUCCESS (to unblock other work for example) and not allow it to change state. The reason it's nice to have it as its own callback is that it clearly separates code that should happen when done (same with onFailure).

Also, if we do this, then onRun is the only one returning the desired new state.

Children management & scheduling

Executor becomes a centralized place that makes decisions on what is going to be run based on circumstances (e.g. waiting for child work to finish)

no, the executor does not care about what should be run: it always calls "run". The only reason we need this WAITING state is to avoid a busy loop: when in "RUNNING" state, the executor has to schedule "run".

The simplest way to think about this is that we first class support for waiting for a callback (triggered by a timer for example). As we first class callbacks, we can provide a wrapper that is the only way to invoke a Work instance, that will basically do something like:

  1. enforce that state == WAITING
  2. set state to RUNNING
  3. schedule run

btw, I noticed something strange in the code base: I see that we have a few places that do something like

mApp.getWorkManager().advanceChildren();

That seems fishy to me: why would a child have to "kick" the work manager? It's probably a work around for a hack of sorts...

Work & WorkParent

So maybe Work should be the base class, and not know about children. Instead, it would focus on proper state transitions. Meanwhile, WorkParent manages children while also utilizing Work's functionality.

Yes, maybe with some points of clarification:

If someone needs to add child work, they can simply derive from the WorkParent or whatever we want to call it.

With all this: is there a reason to have the notify functionality? The problem I see with notify as it is now is that it's completely separate from run (which then requires people to perform all sorts of dance to schedule things before or after children complete). Maybe a way to fix this is when deriving WorkParent implementors get to implement a onRun callback that has an optional Work (maybe an vector even?)? They don't get to implement onRun as we would mark it "final" in WorkParent.

marta-lokhova commented 6 years ago

Ok, from the offline discussion today, looks like I needed to make some changes to the design doc I've been working on that I'll post here -- it's almost done. Meanwhile, some clarifications:

btw, I noticed something strange in the code base: I see that we have a few places that do something like

mApp.getWorkManager().advanceChildren();

I think this comes from a fact that when work is added, nothing kicks it off, it's just kind of sitting there until advance it called. So if something is added to the work manager (top level), it needs a manual kick. This sort of makes sense, because it allows for construction of a work tree, before scheduling anytning. I suppose we could change that since in the new design added work goes straight into RUNNING mode. Thus it would be guaranteed to start running right away, so that no manual maintenance is needed.

marta-lokhova commented 6 years ago

High-level

Goals

What is provided by the work interface

What is required from the implementers

Interface

Disclaimer: we need better names for these 3 components: WorkFSM, Work (think WorkFSM that manages children), WorkManager

WorkFSM & Work (child of WorkFSM) state = {RUNNING, WAITING, SUCCESS, FAILURE_RETRY, FAILURE_RAISE, FAILURE_FATAL}

Here are some descriptions of the methods we'd use:

Below are methods for scheduling running and completion of work

void 
run()
{
    if (state != WAITING)
   { 
         // schedule yourself to run by posting a callback to IO service
         // There might be additional stuff, but the callback would set state to `RUNNING` and call `onRun`
    }
}

void 
complete(State state)
{
    assert(state == DONE) // shouldn’t be called when state RUNNING/WAITING
    if (state == SUCCESS)
    { 
         // similarly to `run`, post to IO service a callback that would set state to SUCCESS then call `onSuccess`
    }
   Else if (state == FAILURE)
    {
     // similarly to `run`, post to IO service a callback that would set state to FAILURE then call `onFailure`
    }  
}

virtual void onSuccess()/onFailureRaise()/onFailureFatal()

Additionally, as mentioned above, we introduce a new state WAITING that would prevent work from being re-scheduled. Specifically we can introduce "wait" callback that would would be triggered by a timer and actually schedule running of work; So work can have a timer and use async_wait with mWaitingCallback. In this case code would look something like:

void
Work::wakeUp()
{
        assert(getState() == WAITING);
        setState(RUNNING);
    notifyWorkManager(); // let WorkManager know it’s time to schedule run again    
}

std::function<void(asio::error_code ec)> mWaitingCallback
{
    mDone = true;
    mErrorCode = ec
    wakeUp(); // wake up and schedule completion
}

void Work::State doActualWork() 
{
    if (!Started)
    {
        Started = true;
        mTimer.asyncWait(mWaitingCallback);
        return WAITING;
    }
    else if (mDone)
    {
        return mErrorCode ? FAILURE : SUCCESS);
    }
}

WorkManager : public Work (potentially WorkScheduler?)

MonsieurNicolas commented 6 years ago

In general, I would prefer that you first describe interfaces without any implementation details (ie public methods only).

Then, you can use those interfaces in

naming

I think they should be: WorkFSM -> BasicWork (or WorkBase) Work stays the same WorkManager, yes WorkScheduler is probably more accurate

interfaces

BasicWork

You should start by describing the BasicWork interface.

I would recommend exposing a crank function that internally invokes doRun (if state is RUNNING). That way, we can expand to other state transitions in the future (like doAbort).

I would expect wakeUp to be implemented at this level and not have a dependency on WorkScheduler.

Also, you need to define when people are supposed to call wakeUp.

Work

You didn't actually describe the Work interface (well it's all mixed up with implementation detail); I suspect that once you have BasicWork, it will become easier for you to do so.

Also, I imagine that there are Work specific methods that would be declared as part of the interface, things like addWork?

Also, void Work::State onRun() method doesn't make sense to me:

implementation

std::shared_ptr yieldNextRunningChild() This method would be used to round-robin child scheduling (in onRun, basically if children are not finished yet, we would get the next child and tell it to run again)

I would prefer the round robin set to be { child_1, ... child_n, self }, that way we only perform one unit of work at a time (with what you wrote it's possible that we perform work in both a child and through doActualWork

For this we can use an iterator that is re-initialized in onReset

This doesn't make sense to me, why would we call onReset when work is continuously running?

Note that WAITING children would not be scheduled

Not sure I understand "scheduled" here, I think you just mean we don't call crank on WAITING state.

Should be able to remove finished work tree from its children upon its success. For this we can have a slightly modified method that checks state of children - if it's SUCCESS, that work tree can be removed.

I think we should have a clear separation of concern between the 3 classes. The only thing special about WorkScheduler is that it schedules crank calls: removing children sounds like a responsibility of Work (if people want to keep children that completed around, they can just keep references in local member variables)

One thing I'm realizing, looks like in this design we would have top work constantly running constantly running (even if work manager has no children); I guess it's an unlikely scenario when WorkManager has nothing to do, but something to keep in mind.

I am not sure I am following, this is the purpose of the WAITING state

marta-lokhova commented 6 years ago

Okay, will update based on the feedback.

what we discussed earlier was to never set the state directly and return the desired new state instead (this is the responsibility of BasicWork)

Ok, I think it should be more clear once I separate BasicWork and Work

it's unclear to me what implementors are supposed to do when they derive from Work

Like I mentioned earlier, doActualWork would Provide work logic (e.g. applying buckets, publishing snapshots) as well as hint the next desired state (returned in doActualWork)

This doesn't make sense to me, why would we call onReset when work is continuously running?

What I meant is that when we transition into important states like retry or failure, child iterator would be reset.

Not sure I understand "scheduled" here, I think you just mean we don't call crank on WAITING state.

Yes, essentially child in WAITING state gets skipped.

I am not sure I am following, this is the purpose of the WAITING state

I was talking about WorkManager in particular. With no children, if would keep cranking (not the case currently; hence, the manual kick as soon as we add a new work tree to work manager). Based on definitions we discussed, I wasn't sure if WAITING is what we want as WorkManager isn't really blocked on anything.

marta-lokhova commented 6 years ago

BasicWork

public:
    enum State
    {
        WORK_RUNNING,
        WORK_SUCCESS,
        WORK_WAITING,
        WORK_FAILURE_RETRY,
        WORK_FAILURE_RAISE,
        WORK_FAILURE_FATAL
    };

    void reset(); // reset work to its initial state 

private:
   // only BasicWork deals with state transitions; it also controls IO service
    void setState(State newState); 

    void run(); // see tentative implementation below
    void complete(State state); // set appropriate completion state and notify scheduler

    void scheduleRun(); // post callback to `run` to IO service
    void scheduleComplete(); // post callback to `complete` to IO service

    // ...Retries helper functions would also go here; they are very similar to what we have now... // 

protected:  
    State mState{RUNNING};
    // Note that these callbacks are protected
    // I don't think there's a good reason for letting outside classes access them
    virtual void onReset();
    virtual void onFailureRetry();
    virtual void onFailureRaise();
    virtual void onSuccess();

    // Implementers provide work logic used in `onRun`, that hints BasicWork what state to transition to.
    virtual State doActualWork() = 0;
    virtual State onRun( return doActualWork(); );

    std::function<void(asio::error_code ec)> mWaitingCallback {...};
    // whenever implementers decide to return WAITING state, they need to make sure
    // to define an event that re-trigger running (call `wakeUp`). For example, in case of a timer, 
    // implementers can use a generic waiting callback defined above. 
    void wakeUp(); 

    void crank(); // schedule proper action based on state (`run`, `retry`, `abort` etc...)

    // at this level BasicWork would just crank itself, since there is no notion of work 
    // manager/children yet, but this kind of weird so we'll let `Work` implement it.
    virtual void notifyScheduler() = 0;

BasicWork::run would be the method that decides state transitions and could look something like:

void BasicWork::run()
{
    auto nextState = onRun();
    If (nextState == DONE)
    {
        scheduleComplete(nextState); // complete will notify work manager
         }
         else 
         {
                setState(nextState);
            notifyScheduler(); // still running or waiting, tell scheduler to crank again
         }
}

Work : public BasicWork

public:
    template <typename T, typename... Args>
    std::shared_ptr<T>
    addWork(Args&&... args) { ...same as current code... };

protected:
    std::vector<std::shared_ptr<Work>> mChildren;
    std::weak_ptr<Work> mScheduler; // main scheduler that schedules cranks

    // remove completed children, additionally implementers can add new children if needed (e.g. BatchWork)
    virtual void manageChildren(); 
    virtual std::shared_ptr<Work> yieldNextChild(); // if child is returned, call crank on it, otherwise crank self
    State onRun() override final; // do slightly more in `onRun`, as now we manage children
    State doActualWork() override; // may additionally check if children are done
    void notifyScheduler() override
    {
        // get shared_ptr for mScheduler 
    mScheduler->crank();
    }

    // ..various helper methods to check state of children.. //

With this, onRun could tentatively look something like:

State onRun()
{
manageChildren(); // Add more work if needed
Auto child = yieldNextChild();
If (child == nullptr) // run self
{
    return doActualWork(); 
}
else 
{
        if (child->getState != WAITING)
        {
             child->crank();
        }
        return RUNNING;
}
}

WorkScheduler : public Work

I think we all the functionality described above (how WS gets notified and children removed), there isn't much for WorkScheduler to do except for ensuring it's not busy-waiting for an event. My main concern is that we might end up with a lot of redundant work being performed. In a tree-like work structure it works okay, because if a leaf work is WAITING, some other work just gets scheduled instead. But if we have a linked list structure, it gets inefficient. So perhaps WorkScheduler could return WAITING state in doActualWork which then gets unblocked by either addWork or mWaitingCallback. This would work for both cases where scheduler has not children, or all of its children are in WAITING state.

Examples

With this, some of the example works would look tentatively like this:

So, implementers have to:

Some other questions/concerns:

MonsieurNicolas commented 6 years ago

BasicWork

Public interface vs protected

can you expand comments on what the various methods are?

I would like the split between public and protected to be:

other questions

Work

Same thing: split public vs protected as it's unclear what a user of Work would do vs an implementor.

More questions:

marta-lokhova commented 6 years ago

BasicWork

Public:
    enum State
    {
        WORK_RUNNING,
        WORK_SUCCESS,
        WORK_WAITING,
        WORK_FAILURE_RETRY,
        WORK_FAILURE_RAISE,
        WORK_FAILURE_FATAL
    };

    BasicWork(Application& app, std::string uniqueName, size_t maxRetries, std::function<void(State state)> notifyCallback);
    virtual ~BasicWork();

    State getState();
    // to avoid confusion with VirtualClock::crank, let’s call it `crankWork`
    void crankWork();

Protected:  
    // Note that these are empty by default, I think
    // implementers should not be forced to implement them 
    virtual void onReset() {};
    virtual void onFailureRetry() {};
    virtual void onFailureRaise() {};
    virtual void onSuccess() {};
    virtual State onRun() = 0;

    // Reset is exposed at this level, as `Work` should also be able to reset when adding children
    void reset()
    {
    // .. Work might reset some internal variables here.. //
    onReset(); // whatever else implementers need to do (e.g. clean up children)
    }

    void wakeUp()
    {
    assert(mState == WAITING);
    setState(RUNNING);
    mNotifyCallback(mState);
    }

    // This is a callback that implementers provide that will do whatever needed
    // internally then notify its parent
    std::function<void(State stateOfChild)> mNotifyCallback; 

Private:
    State mState{RUNNING};
    // Note that there are only 2 places where setState is called: crankWork and wakeUp
    void setState(State newState); 

    // Retries helper functions would also go here; they are very similar to what we have now,
    // except I think we could incorporate WAITING state for the retry delay,
    // and wakeUp when it's time to retry
    // Additionally, retry mechanism should be internal to BasicWork

crankWork could look something like this


void crankWork()
    {
    // We shouldn’t crank on work that is finished or waiting
    assert(getState() != DONE && getState() != WAITING); 

    // For now, this is only for running state, we can add if condition for aborting

    auto state = onRun();
    setState(state);

    if (state == FAILURE)
    {   
        switch(state)
        {
            Case RETRY:
                onFailureRetry();
            Case RAISE:
            Case FATAL:
                onFailureRaise();
                }
                reset(); // note work is reset on any failure
        }
        else if (state == SUCCESS)
       {
           onSuccess();
       }

    // completed a unit of work
    mNotifyCallback(state);
    }

What everybody can do: start running work and check on its state What implementers can do: implement onXYZ callbacks to do whatever they like What only BasicWork can do: transitions states

Work : public BasicWork

Public:
     Work(Application& app, std::string uniqueName, size_t retries, std::function<void(State state)> callback) : public BasicWork(app, uniqueName, retries, callback), ... 
    virtual ~Work( ...ensure children are properly cleaned up... ); 

    template <typename T, typename... Args>
    std::shared_ptr<T>
    addWork(Args&&... args)
    {
    Auto child = make_shared<Work>(...);
    addChild(child);
    mNotifyCallback(); 
    }

    // I imagine we could also have a `addBatchOfWork` (with a better name),
    // so that we notify parent once. 

    // ..Various methods to check the state of children.. 

Protected:
        State onRun() override final
    {
        auto child = cleanUpAdnYieldNextChild();    
        if (child)
        {
            child->crankWork();
            return RUNNING;
                 }
                else 
                {
                    return doWork();
                }
        }

         // Implementers decide what they want to do: spawn more children,
        // wait for all children to finish, or perform work 
        virtual State doWork() = 0;

        void onReset() override
        {
            clearChildren();
        }

Private:
    // remove completed children, round-robin next running child (skip `WAITING` children)
    std::shared_ptr<Work> cleanUpAdnYieldNextChild(); 
    std::vector<std::shared_ptr<Work>> mChildren;

   // Update `mChildren`, reset child work 
   void addChild(std::shared_ptr<Work> child);

What everybody can do: add children, check overall children progress What implementers can do: refer to comment for doWork What only Work can do: manage mChildren, round-robin over children

WorkScheduler : public Work
{
public:
    WorkScheduler(Application& app);
    virtual ~WorkScheduler(); 

protected:
        // When there are no children, all they all are in WAITING state, 
        // go into waiting state as well until event notification or new children 
    State doWork() override 
    {
        Return WAITING;
    }
}

A callback for the scheduler would be slightly different, something like:

auto maybeScheduleCrank = [self](State state)
{
    // children could all be in WAITING state, avoid busy loop 
    if (self->anyChildrenRunning()) 
    {
        if (self->getState() == WAITING)
        {
            self->wakeUp();
         }
        else
        {
            IO_service.post(self->crankWork());
        }
    }
}

Additionally, I think we might need a static create method for WorkScheduler. We first need to instantiate scheduler class (I guess with nullptr for mNotifyCallback), then assign a callback to schedule crank (this is not the case for regular child Work because at class instantiation the parent is known, hence mNotifyCallback can be supplied). I also think that once WorkScheduler is properly setup, it should start cranking right away, so that there's no manual trigger required. (which should work fine even when there's no work to do since it'll go straight into WAITING state)

Examples

ApplyBucketsWork

State doWork() override
{
    // customized behavior for this work
    auto finishStatus = applyBuckets();
    if (stillApplyingBuckets())
    {
        return RUNNING;
    }
    else 
    {
        return finishStatus == success ? SUCCESS : FAILURE;
    }
}

GetHistoryArchiveWork

State doWork()
{
    if (!mGetRemoteFile)
    {
        mGetRemoteFile = addWork<GetRemoteFileWork>...;
        Return RUNNING;
    }
    else 
    {
        if (mGetRemoteFile->getState() == SUCCESS)
        {
            loadFile(); 
            return SUCCESS; 
        }
        else 
        {
            // decide how to proceed if child failed; in this particular case, fail too
            return FAILURE;
        }
    }
}

// A callback defined for children could be like this (passed to `GetRemoteFileWork` in this case) 

std::weak_ptr<Work> self = ...get weak ptr…;
auto parentNotify = [self](State state)
 {
    // Do whatever derived class wants to do internally
    // to react to `state`

    // get shared ptr for self
    self->mNotifyCallback(state); // propagate up 
}
jonjove commented 6 years ago

Overall, looks reasonable now. Some questions and objections:

  1. Why does crankWork call mNotifyCallback? crankWork is always called by the parent (except in the case of the WorkScheduler), so the parent must be RUNNING if the child is RUNNING. So there is no reason to notify the parent in this case, since it can and should take any relevant action in response to the state change the next time it is scheduled for crankWork. This question is a consequence of the fact that notification should only be used to wake up a parent.
  2. Can wakeUp assert that the state is WAITING? As a counter example, suppose that new child work is added to an already running parent. The child doesn't necessarily know that its parent is running (because BasicWork has no access to the parent) so it must attempt to wake its parent.
  3. I don't think the pseudocode for Work::onRun captures the right semantics. Work::onRun should round-robin over its children and self, since self may have to respond to the changing state of its children.

I think it would be helpful if you added more comments to the interface which define in what situations any given function can/should be called.

marta-lokhova commented 6 years ago

Several modifications based on the feedback (pseudocode, might need to be simplified):

  1. yeah, I think mNotifyCallback could be removed out of crankWork, and appear on the scheduler level to schedule the next crank. With that, a callback for the scheduler could look something like: auto maybeScheduleCrank = [self](State state)
    {
    // children could all be in WAITING state, avoid busy loop 
    if (self->anyChildrenRunning()) 
    {
                // This could happen if `addWork` is called while scheduler is waiting
        if (self->getState() == WAITING)
        {
            self->wakeUp();
         }
        else
        {
            IO_service.post(
                {
                    self->crankWork();
                    self->mNotifyCallback();
                                 });
        }
    }
    }
  2. I think you are right, and the waiting state is not guaranteed. I also realized that addWork should rather look like this:
    template <typename T, typename... Args>
    std::shared_ptr<T>
    addWork(Args&&... args)
    {
    Auto child = make_shared<Work>(...);
    addChild(child);
    wakeUp(); // ensure to set state to RUNNING
    }
  3. Given your suggestion, onRun would roughly be:
        // Assuming `mScheduleSelf` is initialized to “false”
        State onRun() override final
    {
        If (mScheduleSelf || !hasRunningChildren())
        {
            mScheduleSelf = false;
                         return doWork();   
                 }  
                 else 
        {
             mScheduleSelf = true;
             auto child = cleanUpAndYieldNextChild();
                 child->crankWork();
                 return RUNNING;
        }

I'm working on adding more comments as well as an initial implementation.

marta-lokhova commented 6 years ago

@MonsieurNicolas @jonjove

Additional clarifications, and some learnings from the initial implementation of work interface:

Scheduler state Children state Scheduler action
RUNNING anyChildrenRunnning() crankWork(), will run more children work
RUNNING !anyChildrenRunnning() crankWork(), will go into WAITING
WAITING anyChildrenRunnning() wakeUp()
WAITING !anyChildrenRunnning() no-op

Actual callback:

mNotifyCallback = [weak]() {
        auto self = /* ...get shared_ptr for weak... */
        if (self->getState() == BasicWork::WORK_RUNNING)
        {
            if (self->mScheduled)
            {
                return;
            }
            self->mScheduled = true;
            self->mApp.getClock().getIOService().post([weak]() {
                auto innerSelf = /* ...get shared_ptr for weak... */
                innerSelf->mScheduled = false;
                innerSelf->crankWork();
                innerSelf->mNotifyCallback();
            });
        }
        else if (self->getState() == BasicWork::WORK_WAITING &&
                 self->anyChildRunning())
        {
            self->wakeUp();
        }
    };

Since the scheduler depends on the children, WorkScheduler::doWork would look like this:

    State doWork() override 
    {
            if (anyChildRunning())
                    {
                        return BasicWork::WORK_RUNNING;
                    }
                    return BasicWork::WORK_WAITING;
    }

The above aligns with expected action in mNotifyCallback.

marta-lokhova commented 6 years ago

Also, few more thoughts to see how abort logic fits into the design. It will be added once the initial design implementation is in place.

Like I already mentioned before, there are really two abort triggers: external and internal. An example of an external trigger is system shutdown, where an abort signal is issued to the work scheduler. An internal abort would be the following scenario: parent work A has some children. Consider one of the children goes into FAILURE_RAISE mode. Next time A is scheduled, it needs to take proper action to ensure the rest of the children are aborted. In order for this to happen, a few things to consider: