jclouds / legacy-jclouds-labs

https://jclouds.apache.org
0 stars 18 forks source link

fixed race condition when creating networks and firewalls #14

Closed dralves closed 11 years ago

dralves commented 11 years ago

when creating a multinode cluster where all nodes are in the same group, there was a race condition where all node creating threads would verify the group's network didn't exist and would try to create it.

of course only one would and the rest would throw errors.

the fix is trivial (add synchronized)

mattstep commented 11 years ago

I think you're explaining it just fine, it sounds broken. You are locking on a strategy object that has nothing to do with the operation itself, there is nothing about this object that indicates that it is unique in this transaction. The transaction can be done with multiple instances of this object without any issues. It is not an obvious candidate for the lock to be held on.

The execute method is written in a way that assumes it is called once, I am asking you to ensure this behavior.

Also, don't use assert in the execute method, throw an exception.

dralves commented 11 years ago

I'm honestly not seing how it sounds broken how your solution improves it.

I'm saying: I want this bunch of nodes, you can create them in parallell, they need a network and the network needs to be created for a particular group if it doesn't exists so serialize getOrCreateNetwork. Everything else can continue to be parallell.

You're trying to change the semantics to: "everything needs to be serialized independently of which group nodes belong to" and I'm not seing as that can be advantageous.

"You are locking on a strategy object that has nothing to do with the operation itself, there is nothing about this object that indicates that it is unique in this transaction"

This particular strategy is where resources like security groups are created in other cases, so it has become the "de facto" standard place to so this kind of thing. Moreover making execute() synchronized would still make the lock be held by this object which does not improve on "It is not an obvious candidate for the lock to be held on".

Now one could argue that an external function with getOrCreate semantics that'd have the serialization builtin might be a cleaner implementation we could even improve it later with per-group or per-network locks instead of what is in fact a global lock, but I still think it's better than to serialize everything.

In any case this issue makes GCE boken and I'd like to fix it so I'll do as you suggest.

mattstep commented 11 years ago

If you are going to synchronize, just synchronize the execute. If you want to actually fix this problem, then find a way to ensure those methods are only executed once, since there is no reason for them to be executed repetitively.

dralves commented 11 years ago

move the synchronized to the execute() method and addressed the other comments. @mattstep getOrCreate* need to be called once for a group and not once in general, making sure the network call is done only once would require cacheing, a strategy from which I've been discouraged in the past, so I think this is as good as it gets for now.

codefromthecrypt commented 11 years ago

normally, I've used a LoadingCache for things like getOrCreate in order to save thrashing and network calls, plus it has built-in concurrency controls. The current synchronized approach will work, trading efficiency for simplicity. I'm ok with this change for now.

dralves commented 11 years ago

I'd still prefer the fine grained per method synch, but I understand that making the whole think a synch block is easier to understand. @adriancole with your +1 could you merge plz? I'd like to make sure this gets into the next 1.6.0.x otherwise GCE will be broken when used the way that a few downstream projects use it.

codefromthecrypt commented 11 years ago

yeah I'll merge as changing this in any other way would be a lot of code.