nirmata / workflow

A ZooKeeper and Curator based distributed workflow management library that enables distributed task workflows.
http://nirmata.github.io/workflow
Apache License 2.0
96 stars 48 forks source link

uneven workflow distribution #19

Closed dtoledo67 closed 8 years ago

dtoledo67 commented 8 years ago

I am using 3 Workflow Managers distributed on 3 EC2 instances. The Workflow Managers belong to the same namespace and they execute the same type of task. The workflow managers have the same configuration (task executor, task type, quantity, ...). I observed that the number of tasks executed on each instance is significantly and consistently different. For instance: Tasks executed by workflowManager 1: 7684 Tasks executed by workflowManager 2: 12120 Tasks executed by workflowManager 3: 15296

I reproduced these results using 3 Workflow Managers running in the same JVM and using the test server (see attached file). I used nirmata-workflow-0.5.1 & org.apache.curator:curator-test:2.8.0 for this test. WorkflowTester.java.txt

Randgalt commented 8 years ago

Interesting - distribution is essentially random. Each worker will take tasks as it can. I'll have a look if there's something other than randomness causing this. It if is indeed random, though, should we introduce some kind of distribution mechanism that tries to evenly distribute tasks?

dtoledo67 commented 8 years ago

If it is really random then I don't think we need another mechanism: small variations are fine. The problem here is that it doesn't look random. Here is how it looks on our production system with about 1 million tasks executed. In this case, we have about 15 different task types. The level of concurrency configured for each type of task is different. The pattern looks more or less the same for each task type. After running multiple experiments I think the pattern is more visible when you execute a large number of tasks.

[image: Inline image 2]

On Wed, Dec 23, 2015 at 9:35 AM, Jordan Zimmerman notifications@github.com wrote:

Interesting - distribution is essentially random. Each worker will take tasks as it can. I'll have a look if there's something other than randomness causing this. It if is indeed random, though, should we introduce some kind of distribution mechanism that tries to evenly distribute tasks?

— Reply to this email directly or view it on GitHub https://github.com/NirmataOSS/workflow/issues/19#issuecomment-166951444.

Randgalt commented 8 years ago

I believe I know what's happening. Internally, Workflow uses Curator's DistributedQueue to get tasks that need processing. Unfortunately, ZooKeeper only provides getChildren() to get children under a node - i.e. it returns ALL children. Each Task worker then tries to process all current tasks. There's code to prevent duplicates but it could mean that the first process to call getChildren() will likely process more than other processes.

Fortunately, there is a workaround. Curator's PriorityQueue processes children a small amount at a time to allow for higher priority new items getting inserted. I did some testing with this and seems to help. So, simply change your TaskType's mode from the default to TaskMode.PRIORITY. I ran your tests with the default and I get this:

workflow tester workflowManager-1 has executed 376 tasks
workflow tester workflowManager-3 has executed 336 tasks
workflow tester workflowManager-2 has executed 288 tasks

Using TaskMode.PRIORITY (the only change) I now get this:

workflow tester workflowManager-2 has executed 353 tasks
workflow tester workflowManager-3 has executed 313 tasks
workflow tester workflowManager-1 has executed 334 tasks

So, if possible, please test using TaskMode.PRIORITY. It shouldn't affect performance as all tasks will have the same priority by default.

Randgalt commented 8 years ago

I've re-written the queue code to more evenly distribute. Please test with: https://github.com/NirmataOSS/workflow/tree/simple-queue (i.e. branch "simple-queue")