Listing my comments here, since there is no PR.

Atomix

First, I looked at the workq described here: https://atomix.io/docs/latest/api/io/atomix/core/workqueue/WorkQueue.html

While it does some of what the WorkAllocator does, it doesn't seem to address the cross worker rebalancing problem that WorkAllocator implements. If we want that, then we still have to implement much of what is in WorkAllocator. On the other hand maybe rebalancing is a premature optimization at this point, and a simple workq is sufficient.

However, Atomix workq also seems to require some cluster configuration, because the workq is implemented over a cluster created by the user nodes. Since we are already doing that for etcd, it seems a bit redundant.

Bottom line, Atomix is not a clear winner to me, (although it probably would be if it was built on top of etcd,) but I wouldn't object to using it if you others felt differently.

I've googled around some but haven't found a better etcd specific alternative to Atomix workq's.

WorkAllocator specific comments:

I'm confused by this is test: https://github.com/itzg/try-etcd-work-allocator/blob/master/src/main/java/me/itzg/tryetcdworkpart/services/WorkAllocator.java#L404 We own the active key, so don't we already know it exists?

Just to be clear, version "0" here means it doesn't exist?: https://github.com/itzg/try-etcd-work-allocator/blob/master/src/main/java/me/itzg/tryetcdworkpart/services/WorkAllocator.java#L488

If the transaction failed, but we stop processing here, then we still have the active key, so no one else is going to process either, right? That seems like a problem: https://github.com/itzg/try-etcd-work-allocator/blob/master/src/main/java/me/itzg/tryetcdworkpart/services/WorkAllocator.java#L423 Similarily, this gets popped even if the txn fails: https://github.com/itzg/try-etcd-work-allocator/blob/master/src/main/java/me/itzg/tryetcdworkpart/services/WorkAllocator.java#L369

Don't we have to guard these in a try/finally: https://github.com/itzg/try-etcd-work-allocator/blob/master/src/main/java/me/itzg/tryetcdworkpart/services/WorkAllocator.java#L391 https://github.com/itzg/try-etcd-work-allocator/blob/master/src/main/java/me/itzg/tryetcdworkpart/services/WorkAllocator.java#L474

we are starting from rev 0 here. will that lead to a lot of unneeded work? https://github.com/itzg/try-etcd-work-allocator/blob/master/src/main/java/me/itzg/tryetcdworkpart/services/WorkAllocator.java#L252 https://github.com/itzg/try-etcd-work-allocator/blob/master/src/main/java/me/itzg/tryetcdworkpart/services/WorkAllocator.java#L330

introductory Workallocator documentation that would have helped me.

The code in this module allows processes to add/update/delete tasks to a distributed workq and allows the workers processing the q to be aware of each other and increase/reduce their share of the total work, as workers come and go.

data structures

The workq is implemented with a set of etcd kv's prefixed with the name "registry". Each new task in the q is a new kv pair with that prefix.

As workers pick up tasks from the q, they create an analagous entry in etcd with the prefix "active". That way the other workers know the task is in progress.

Finally, there is a set of kv pairs with the prefix "workers". This allows the workers to know of the existance of the others so they can shed work if the load becomes unbalanced.

threads

This module manages the work/worker q's by creating three threads, (each of which watches one of the prefixes listed above):

The registry thread watches for incoming work in the "registry" kv pairs and "grabs" it in a etcd transaction to ensure no two workers grab the same task.

The worker thread checks for incoming workers; as new workers arrive, it determines what should be it's new share of the total load and sheds tasks when it determines it has too many.

Finally, the active thread checks for tasks shed by other processes and attempts to grab them if the current process is underloaded.

On the other hand maybe rebalancing is a premature optimization at this point, and a simple workq is sufficient.

Thinking about current services that don't rebalance and the problems it causes us:

Noit
- Luckily doesn't break much, but we have had times where numerous collectors have been down for an extended period.
- Having to manually offload checks from down collectors is a pain
- Means checks won't run in that zone. At least means we're now doing a quorum of 2 zones; at worst means we're not monitoring them at all.
AEPs / SNET Proxy
- We have to restart all of them at once to try get things balanced
- Trying to fix it triggers a large amount of unnecessary operations, as we have to break down "good" connections
- Very painful
- Having rebalancing built in should make life much easier for us.

I'd rather we consider it in the design early than try to force it in later.

@GeorgeJahad , @jjbuchan , some quick notes for now and I'll review the code questions later on today or Monday.

I agree that we should ensure some amount of auto load balancing is considered in the initial design, which work allocator does solve as a first-class problem. With a dynamically scaled set of services deployed via kubernetes having them self-organize is going to be essential to save our sanity
I also agree that the more I looked at Atomix's WorkQueue it appears to be a poor fit for our problem. The primitive of theirs that seems to get closest is a leader election per partition/shard; however, that still wouldn't account for load balancing across the partitions/shards.

WorkAllocator specific comments:

I'm confused by this is test: https://github.com/itzg/try-etcd-work-allocator/blob/master/src/main/java/me/itzg/tryetcdworkpart/services/WorkAllocator.java#L404 We own the active key, so don't we already know it exists?

Yes we do, so it's an "always true" condition just to force the transactional put+delete in the "then" condition. Doing this in a transaction is probably overkill, but seemed like a logical grouping of operations in any case.

Just to be clear, version "0" here means it doesn't exist?: https://github.com/itzg/try-etcd-work-allocator/blob/master/src/main/java/me/itzg/tryetcdworkpart/services/WorkAllocator.java#L488

Yes and just added a new unit test to verify the assumption https://github.com/itzg/try-etcd-work-allocator/blob/e21fa9074034073f16f0fc76b088687def6103b7/src/test/java/me/itzg/tryetcdworkpart/services/WorkAllocatorTest.java#L377

If the transaction failed, but we stop processing here, then we still have the active key, so no one else is going to process either, right? That seems like a problem: https://github.com/itzg/try-etcd-work-allocator/blob/master/src/main/java/me/itzg/tryetcdworkpart/services/WorkAllocator.java#L423

Yeah, the code was already feeling complicated and the failure case so rare (other than etcd failure), so I punted with any kind of recovery logic here. We could add retry logic, but I suspect etcd client is already doing that. I tried to use warning logs sparingly and intend that each be a cause for human investigation.

Similarily, this gets popped even if the txn fails: https://github.com/itzg/try-etcd-work-allocator/blob/master/src/main/java/me/itzg/tryetcdworkpart/services/WorkAllocator.java#L369

Good point. Perhaps the work ID should be pushed back if the releaseWork calls completes with a false.

Don't we have to guard these in a try/finally: https://github.com/itzg/try-etcd-work-allocator/blob/master/src/main/java/me/itzg/tryetcdworkpart/services/WorkAllocator.java#L391

My assumption is that exceptions during etcd will get passed to this handle call where the semaphore is released https://github.com/itzg/try-etcd-work-allocator/blob/master/src/main/java/me/itzg/tryetcdworkpart/services/WorkAllocator.java#L419

https://github.com/itzg/try-etcd-work-allocator/blob/master/src/main/java/me/itzg/tryetcdworkpart/services/WorkAllocator.java#L474

we are starting from rev 0 here. will that lead to a lot of unneeded work? https://github.com/itzg/try-etcd-work-allocator/blob/master/src/main/java/me/itzg/tryetcdworkpart/services/WorkAllocator.java#L252 https://github.com/itzg/try-etcd-work-allocator/blob/master/src/main/java/me/itzg/tryetcdworkpart/services/WorkAllocator.java#L330

Negative and zero are special values that instruct the watch to start from newest revision: https://github.com/etcd-io/jetcd/blob/859ba2771afcc73b27a5916c4e0969606f688d9d/jetcd-core/src/main/java/com/coreos/jetcd/options/WatchOption.java#L55

Yes we do, so it's an "always true" condition just to force the transactional put+delete in the "then" condition.

My question is, can't you just remove that call to the if() method, if it is always true? In other words, doesn't etcd txn()'s allow then()'s without if()'s?

Yeah, the code was already feeling complicated and the failure case so rare (other than etcd failure), so I punted with any kind of recovery logic here.

In that case, since the work stops no matter what, why does the workload get incremented again on failure here?: https://github.com/itzg/try-etcd-work-allocator/blob/master/src/main/java/me/itzg/tryetcdworkpart/services/WorkAllocator.java#L431

My assumption is that exceptions during etcd will get passed to this handle call where the semaphore is released

ok, thanks

Negative and zero are special values that instruct the watch to start from newest revision

Ah, good to know, thanks.

Thanks @GeorgeJahad

My question is, can't you just remove that call to the if() method, if it is always true? In other words, doesn't etcd txn()'s allow then()'s without if()'s?

Oh interesting, you can indeed have a transaction execute the Then block without an If: https://github.com/itzg/try-etcd-work-allocator/commit/a257767a727bd76bff459edbbf317e139552403e#diff-9aabd8dc24134c3e25e37c2b3bad5907R399

I have updated that code to use the "always true" if-condition.

In that case, since the work stops no matter what, why does the workload get incremented again on failure here?: https://github.com/itzg/try-etcd-work-allocator/blob/master/src/main/java/me/itzg/tryetcdworkpart/services/WorkAllocator.java#L431

Since the transaction failed, then we know the work load stored in our active key is the non-decremented value. The incremented value was to get it stay consistent with the etcd stored value in the active key; however, now I can see that it could leave the value alone and re-store the correct value on the next operation.

I changed the referenced code and grabWork here https://github.com/itzg/try-etcd-work-allocator/commit/ae841a1b55c876e563bafe326db4e2534655a5d5

itzg / try-etcd-work-allocator