maleck13 commented 6 years ago

Proposal

Currently we use a subscriber pattern where as a Job is started and is given a channel to send messages on, there is then a subscriber to this channel which uses the DAO to update state based on the msg sent. While this works, I wanted to open a constructive discussion about potential alternatives that may simplify and improve on this.

Some thoughts on this approach:

As there is only one subscriber to this channel, it causes a natural bottleneck. Once a Job has sent a message to the channel, no other work can be done nor can state be saved until after the subscriber has completed its work. Currently this work is all synchronous so only one Job state will be saved at any given time and all other Jobs of the same type will block waiting to send their messages through the channel until it is complete.
There seems to be very little benefit in the current subscribers. The Jobs know the state they are in and also have access to the credentials sending this information through a channel to be saved seems overly complex.

Alternatives

Pass an interface that exposes the required methods of the DAO to the Jobs allowing the jobs to make the required state changes. As these jobs are running within their own go routine and no longer using the channel, it would remove any potential to impact on other Jobs while simplifying the code.
Add a set of Observers to each job that are passed the JobState as it changes, these Observers could ensure we keep the responsibilities decoupled.
metrics
logging
state update

Sudo code:

NewProvisionJob(..., observers []ProvisionStateObserver){}

(p *ProvisionJob)Run func(...){
   js := JobState{}
   for _, ob := range p.observers{

      go ob(js)

   }

}

The above code may not be clear enough. If there is appetite for either of these approaches I am happy to create a more detailed proposal.

@jmrodri @rthallisey based on comments in other PRs created this issue to reflect some of my thoughts on the subscribers. Hope it makes sense.

jmrodri commented 6 years ago

Adding @eriknelson since he wrote the original WorkEngine/WorkMsg that we modified for use with our broker.

jmrodri commented 6 years ago

The idea was there could be many workers for a single type. I.e. multiple provision jobs running with a single provision subscriber. The channel between them is a buffer so unless the buffer is full, the jobs and subscriber can run concurrently. Once the buffer is full with msgs, then the jobs would block until the subscriber finishes reading the messages.

                                          <-  job1
dao <- subscriber <- [ buffered channel ] <-  job2
                                          <-  job3

jmrodri commented 6 years ago

@maleck13 so in your description this is what I picture:

       metric <-\
      logging <--- job1
status update <-/

       metric <-\
      logging <--- job2
status update <-/

jmrodri commented 6 years ago

I would like to backup and ask, what is it that we are trying to solve at a high level? When we put the work engine into the broker, I was trying to solve running multiple provisions, deprovisions, etc. concurrently for a given instance. With the subscriber being the sole dao writer, it was easy to ensure that we weren't going to clobber something since it controlled what went in for a set of jobs.

I know there's granular status updates which is something we're trying to put in (I think that's what the last_operation work is about). Besides that what else is we're trying to solve?

concurrent jobs for a given instance
granular status updates
??

eriknelson commented 6 years ago

Once a Job has sent a message to the channel, no other work can be done

@jmrodri is right, this statement is not true. You can have as many jobs (APB actions) as you like executing in parallel. Jobs will block on channel write if the rate at which the subscriber messages is less than the rate at which jobs are finishing. I'll also point out that our jobs don't do any useful work after writing to the channel. They write when they error out, or finish their work, so a blocked job write isn't big deal. I foresee that may change if we start to introduce ongoing progress notifications, since of course you do not want to block a job trying to report its progress from continuing.

There seems to be very little benefit in the current subscribers. The Jobs [...] have access to the credentials sending this information through a channel to be saved seems overly complex.

The jobs were designed to do the work in parallel and pass the result (credentials) off to someone else (subscriber) to deal with. The single subscriber reading off the result queue works as a sync mechanism so we don't have a bunch of subscribers trying to write to the same resource via the dao. There is nothing that prevents us from attaching as many subscribers as we'd like to accelerate the processing of the queue, but we haven't explored whether race conditions may bite us as a result.

The Jobs know the state

This is a good point. It makes sense to me that the jobs themselves should be the ones writing their current state to the message, especially when you consider progress updates, since of course the jobs themselves are going to know that information, not the subscriber. I would definitely be in favor of the jobs writing their state to the message. Jobs should not be writing their own state to etcd however. Their job (harhar), is to produce a generic result and leave that up to a subscriber to define what it would like to do with it.

There seems to be very little benefit in the current subscribers. The role of the subscriber is to do something useful given the result. In our case we only have an etcd subscriber who's single concern is persisting that data in etcd. This is decoupled from the job deliberately. Tomorrow I may want to also attach an SQS subscriber that sends it off to Amazon in addition to writing it to etcd (admittedly you would need something pulling messages off the queue and notifying all attached subscribers of the message, but that's easy to do).

Just wanted to explain the system and some of the intentions behind it when it was originally written, since that will likely inform this conversation. Additionally, do we have any examples of today's pattern demonstrating actual performance issues?

Need to think a bit more on the proposed alternatives. I'm having a hard time understanding what value would be gained.

eriknelson commented 6 years ago

The work engine/subscribers/jobs is also mostly syntactic sugar on top of the fan-in pattern, https://blog.golang.org/pipelines

eriknelson commented 6 years ago

If you replace observer with different typed subscribers (since really they're the same thing, different words), expand the "engine" to take messages off the buffer and fan them out to subscribers, you've got the classic fan-in, fan-out golang pipeline:

MetricsSubscriber <-\                                      /-< job1
LoggingSubscriber <--- [Engine] <- [ buffered channel ] <---- job2
   DaoSubscriber <-/                                       \-< job3

maleck13 commented 6 years ago

@eriknelson @jmrodri thanks for your detailed replies, really appreciated. It is quite a lot of information so i will try to take all into account with my reply :)

@jmrodri is right, this statement is not true. You can have as many jobs (APB actions) as you like executing in parallel. Jobs will block on channel write if the rate at which the subscriber messages is less than the rate at which jobs are finishing.

Absolutely. This is my mistake, feel a bit silly for having missed the fact that the channel being used was buffered channel. As you mention however, if the buffer became full, things would begin to block.

Just wanted to explain the system and some of the intentions behind it when it was originally written, since that will likely inform this conversation.

It is extremely useful to get this context and appreciated.

They write when they error out, or finish their work, so a blocked job write isn't big deal. I foresee that may change if we start to introduce ongoing progress notifications, since of course you do not want to block a job trying to report its progress from continuing.

Yes this is part of what got me thinking about the current pattern. In my WIP implementation, I send a status update as the Job starts (to update the state for the last_operation and communicate "job started"). I also added a "statusupdate" channel , that exists for the lifetime of the Job, and gets passed into the job business logic. This channel is sent messages as we watch the pod and communicates the last_operation (it could potentially used for more than this in the future). The Job reads from this new channel and, as the Job progresses, with each status/last_operation update it receives, it then sends through a state update through the original JobMsg channel. Finally when the job logic is completed a final message is sent to say Job succeeded or failed.
This may be made clearer by looking at the code only maybe 50 lines or so: https://github.com/maleck13/ansible-service-broker/blob/674aebb963ecda112591bbb82e583aaddd3f187b/pkg/broker/provision_job.go#L62

I was increasingly concerned that I would be overloading the original shared topic channel as the number of writes to it could significantly increase, plus now it was no longer the case that the Job only wrote to the channel on success or failure so the Job could become blocked. This got me thinking about a pattern that may avoid this potential issue altogether.

Also as I was reading code, I noticed we had several concerns involved as already mentioned: -state -logging -metrics maybe others..

This lead me down the Observer route.

If you replace observer with different typed subscribers (since really they're the same thing, different words), expand the "engine" to take messages off the buffer and fan them out to subscribers, you've got the classic fan-in, fan-out golang pipeline:

Totally agree same thing different words. This is effectively what I was trying to get at. One message is received and many things are told. I am not against the use of channels for this, but wondered would it improve the readability and reduce complexity if the observers were registered and referenced from the Job. However I freely admit that I likely have not given it enough thought and I definitely know less of the context and design decisions in the broker.

Hopefully this clarifies where I am coming from and why I opened the issue and gives some food for thought.

Edit

Jobs should not be writing their own state to etcd however, this is decoupled from the job deliberately. Tomorrow I may want to also attach an SQS subscriber that sends it off to Amazon in addition to writing it to etcd (admittedly you would need something pulling messages off the queue and notifying all attached subscribers of the message, but that's easy to do).

After rereading, I wanted to revisit this point. I completely agree the Jobs should know nothing about etcd. I think this could be easily decoupled behind a storage api interface that is passed to the Job. Whether the implementation of the API is writing to etcd or sqs or a file or all three has no impact on the Job's use of storage API, but would have the added benefit of removing the indirection created by a channel. However the fan in fan out / observer approach has an added benefit of allowing the different concerns to be separated and added to independently of the Job so overall I think if we were to make a change to the current design, this kind of approach would be better long term

End Edit

On a final note, I am really enjoying working on the broker and I am keen to get more involved.

eriknelson commented 6 years ago

@maleck13 Had a chat with @shawn-hurley and @jmrodri last week about this more in the office; I think your point stands that a single buffered channel is certainly an opportunity for bottlenecks, particularly in the case of granular status updates. Are you available sometime this week to jump on a call?

On a final note, I am really enjoying working on the broker and I am keen to get more involved.

It's great to see more folks getting involved, thank you for your contributions and feedback!

maleck13 commented 6 years ago

@eriknelson Yes can be available. Thursday is a good day for me. I am on GMT time so the afternoon prob makes the most sense.

eriknelson commented 6 years ago

Rescheduling this call for early next week due to the monster snow storm in Raleigh.

maleck13 commented 6 years ago

Based on conversations, I am happy to take what we discussed (fan in fan out) and turn it into a proposal that can serve to communicate the design and principals and have you guys way in on the discussion there to get to the best solution

eriknelson commented 6 years ago

Attaching the design diagram from the meeting this morning:

photo_2018-01-25_14-28-18

maleck13 commented 6 years ago

tracking this here https://issues.jboss.org/browse/AEROGEAR-2010 but will add any updated to the issue also. The Jira issue is more to indicate when the issue is being worked on

openshift / ansible-service-broker

Proposal: replace the current subscriber pattern #638

Proposal