fix: local commit detection not thread safe

tracemeyers commented 1 year ago

synchronizeWithRemoteRepository did not handle the scenario where other threads may modify the local repository between the merge operation and when hasLocalCommits is set to false. This can cause the local view of the repo to become out of sync with the remote. At best it is resolved when another OSB operation, e.g. a provision request, creates a new commit with the side effect that it signals there are local changes. At worst the commits are lost if the repo storage is ephemeral and is destroyed before the commits are pushed.

JohannesRudolph commented 12 months ago

Hey @tracemeyers thank you so much for your pull request and sorry for leaving this open and unresponded for so long (I didn't get a notification about it, which is my fault for not setting them up on GitHub).

I appreciate you tackling a change in this intricate part of the code. To enable us to tackle the issue, can you provide a test case that fails before and passes after the change? We do already have some similar test cases in ConcurrentGitInteractionsTest for example.

If you also rebase your changes against main (we recently changed from master to main) the GH actions should also run the tests for your changes.

JohannesRudolph commented 12 months ago

synchronizeWithRemoteRepository did not handle the scenario where other threads may modify the local repository between the merge operation and when hasLocalCommits is set to false.

On the service broker side, all git interactions should go through GitHandlerService which is a @Service and so a singleton. I would therefore assume that there is not other thread working on the git repo. Can you clarify what you mean here?

tracemeyers commented 12 months ago

On the service broker side, all git interactions should go through GitHandlerService which is a @Service and so a singleton. I would therefore assume that there is not other thread working on the git repo. Can you clarify what you mean here?

It's been a while now, but I believe the sequence of events that resulted in a race condition was something like this...

Assume there are two threads: A, B
Initial state of hasLocalCommits is ~false~ true. (Update: needs to be true so synchronizeWithRemoteRepository will push something.)

A - calls synchronizeWithRemoteRepository which in turn calls fetchMergePush. Let's assume thread A is preempted at an inopportune time, like just after the push near the end.
B - starts running (e.g. due to a provision request), calls commitAllChanges and completes. ~This has the side effect of setting hasLocalCommits to true.~ (Update: it was already true.)
A - resumes execution, returns from fetchMergePush with a return value of true, and executes this code which results in hasLocalCommits being switched from true to false.
```
  val pushedSuccessfully = fetchMergePush()
  if (pushedSuccessfully) {
    hasLocalCommits.set(false)
  }
```

At this point whenever synchronizeWithRemoteRepository is called it will exit early because hasLocalCommits is false. This results in the pending commit going unnoticed by unipipe until an external event (e.g. (de)provision API call) results in the hasLocalCommit flag being updated. However, if an external event never happens and the unipipe instance is torn down and its git storage is ephemeral, then the commit is permanently lost.

It is a relatively rare race condition outcome but one I encountered often enough that I had to resolve it to get the occasional (de)provision flowing without manual intervention. Also the risk of data loss is pretty serious.

tracemeyers commented 12 months ago

To enable us to tackle the issue, can you provide a test case that fails before and passes after the change?

That's fair. It might take me a while to work on this, but I'm still keen on getting this merged so I don' t have to maintain a fork :grimacing:

tracemeyers commented 12 months ago

@JohannesRudolph added a test. Had to rework the implementation a bit in order to make it testable.

I have zero experience with kotlin so don't be shy with suggestions to improve things.

JohannesRudolph commented 12 months ago

@tracemeyers now I understand what you mean, thanks a lot for the test case!

I see that scenario happen clearly under one condition - there must be two threads active at the same time performing (write) operations on the git repo. Http requests can come in from different threads concurrently, so that fits the scenario.

However we have GitOperationContextFactory to handle locking the git repo though, with the idea explained in a comment at the top. In theory that should solve it, as long as there's no code path that bypasses this check (and e.g. gets a GitHandlerService directly from spring....)

/**
 * Design: Provides a simple means to synchronize/serialize all git repository interactions.
 * Only one [GitOperationContext] can be active at a single time.
 *
 * Typical git interactions should be fairly short-lived and therefore queuing them should be fine.
 * Of course this limits the theoretical throughput of the system. However a typical unipipe deployment does not have to
 * handle a lot of requests per second so this should not be much of a problem. If it becomes a problem, we can optimize
 * this further at the cost of some additional complexity (e.g. separate read/write paths).
 */
@Service
class GitOperationContextFactory(

This invariant also ensures that even under concurrency we will have one clean git commit per broker operations, which would otherwise be hard to maintain.

Of course I don't want to rule out the possibility that there's a bug in there somewhere! I'll need to take a second look, especially around deprovisioning as you say.

tracemeyers commented 11 months ago

The invariant seems reasonable and I can't find any fault with it under a cursory glance at the code base.

Ignoring the invariant, isn't it better to check the actual state of the repo than to check a flag that has the possibility to be inaccurate? Obviously the cost is higher but I'd rather have assurance of correct behavior without a risk of data loss than a few ms.

I can only speak from experience where we have frequently encountered an OSB action that is not pushed to the remote and the logs clearly show unipipe does not think there are local changes. The simple workaround is to start another OSB action (e.g. (de)provision) that creates a new commit which ends up pushing both the existing commit for the missing OSB action and the commit for the new OSB action. However, there's still the possibility that if the container is torn down before another OSB action occurs then the missing OSB action is lost.

I doubt I'll have the time to try to identify the root cause of the above. Feel free to close the MR if my argument isn't convincing enough :smile:

JohannesRudolph commented 4 months ago

@tracemeyers I've had a better look at this and I think you have a very reasonable argument that your changes will make the implementation more resilient. So I will go ahead and merge in your changes, thanks so much for your contribution!

I've found a few potential issues where the locking code could have been bypassed, though from just reviewing the code I still could not figure out a scenario where that was the case. I will close those loopholes as well

meshcloud / unipipe-service-broker

fix: local commit detection not thread safe #120