Open stefanhaller opened 1 year ago
I read up on concurrency in go a bit more, and I'm getting dizzy. 😄
I just discovered the "-race" command line option of go, which seems very useful. For example, running lazygit as go run -race main.go
(or rather something like GORACE="log_path=/tmp/lazygit-race" go run -race main.go
) logs quite a large number of data races. I feel we have a lot of work to do to get our use of mutexes right...
Curious about your thoughts on this @jesseduffield.
Yep, our concurrency model is shocking haha. You can use a read-write mutex which allows multiple reads at once but exclusive writes, but I don't like the idea of adding a bunch of locks and unlocks everywhere, nor do I have a good idea what a more callback approach would look like (where the mutex stuff is encapsulated).
I do try to make it so that model objects are replaced rather than mutated when we refresh, and I try to assign a model object to a variable and then use that variable in a keybinding handler so that if a refresh happens we might be working with stale data but we won't get a slice bounds error
But it would be good to structure things in a way where we guarantee that while you're in a handler, the model won't change. All model changes should happen in the refresh helper so that sounds doable to me. Other forms of mutation like gui state and context state I'm less clear on how to fix that.
But I would love to solve this problem and have that race detector find no issues so we can add it as a CI step
Looking closer at the code, I think the invariant that a given model slice won't be mutated does hold in this case, but we can't make use of that fact because in getDisplayStrings we use c.Model().Commits
which may be a different slice to the one we worked with when obtaining the end index. We can fix this in the short term by just clamping the end index again inside presentation.GetCommitListDisplayStrings
before we use it (I've confirmed this works).
diff --git a/pkg/gui/presentation/commits.go b/pkg/gui/presentation/commits.go
index b4297f6ed..1ecc413e8 100644
--- a/pkg/gui/presentation/commits.go
+++ b/pkg/gui/presentation/commits.go
@@ -71,6 +71,10 @@ func GetCommitListDisplayStrings(
// this is where my non-TODO commits begin
rebaseOffset := utils.Min(indexOfFirstNonTODOCommit(commits), endIdx)
+ // endIdx may have been based on a prior instance of the commits slice so we
+ // clamp it here to ensure it's within bounds
+ endIdx = utils.Clamp(endIdx, 0, len(commits))
+
filteredCommits := commits[startIdx:endIdx]
bisectBounds := getbisectBounds(commits, bisectInfo)
I do try to make it so that model objects are replaced rather than mutated when we refresh, and I try to assign a model object to a variable and then use that variable in a keybinding handler
That's good, but both of these only reduce the likelihood of it being a problem. The race is still there.
I really see only two ways to solve this:
onUIThread
that sets the data. The downside is that you can no longer do a sync refresh and then repaint to ensure that there are no UI glitches, like we do here. This makes it a deal-breaker for me.We can fix this in the short term by just clamping the end index again inside
presentation.GetCommitListDisplayStrings
before we use it
I really don't think this is an appropriate fix. It only works around the panic, but it doesn't really fix the behavior. What if the slice gets longer rather than shorter? We're only drawing half of the view in that case. It's just not right.
Fair point: that would indeed only fix panics.
As for the bigger fix, perhaps we should do some research into how other gui frameworks (e.g. tview) handle concurrency and see how easily we could adapt the same approach
Tview has a documentation section about this, but it doesn't say much except that most of tview is not thread-safe, and doesn't have to because you typically interact with it on the main thread only. It doesn't say anything about model data, it seems it's simply the client's responsibility to deal with this in whatever way they want.
I'm coming back to this now. I made a PR (#3019) that allows running integration tests with the -race
flag, and this shows tons of concurrency problems, not just for model data but also for views.
I spent a bit of time experimenting with more locking, for example to extend gocui.View's writeMutex to a general mutex that protects all its fields, not just the lines buffer. And also to put more locks around reading model data, as discussed above. I quickly got the feeling that this is not a viable approach; it's just too messy to get all the locking right.
So next I'm planning to experiment with the opposite approach: make it so all model data (and maybe even views?) is read and written only from the main thread. Concretely this means that refresh can still happen in a goroutine (e.g. calculate a new slice of model commits), but then the final assignment to the Model struct needs to be done with an OnUiThread
. And the PostRefreshUpdate too. This should work as long as we make sure we never mutate model slices, only assign new ones.
As I said above, this makes it impossible to do a sync refresh and repaint to ensure you can do multiple ctrl-j in quick succession without glitches; but I suppose this could be solved by introducing a new "UI is blocked" mode where we don't dispatch commands but buffer them until the mode is over. This would allow ctrl-j to use a normal WithWaitingStatus instead of the hacked WithWaitingStatusSync that I added in #2966; I think that would be cleaner anyway.
I'd like to hear your thoughts though before I start, as I expect it to be a pretty huge amount of work. Do you think it's worth trying? Can you foresee any problems that I might be missing?
Centralizing the writing of model data to one place is easy, but centralizing the reading will be very complex. It would be worth just testing it out for a single model e.g. commits and see how many touchpoints there are. I'd also use a read-write mutex which allows multiple concurrent reads so long as there's no writes
but centralizing the reading will be very complex
Why do you think so? Maybe I'm naïve, but my thinking was that whenever a go routine needs to look at model data, it needs to be passed to it in a variable.
My hope would be that this allows us to get rid of the model mutexes altogether.
Sorry forgot to respond to this. My assumption is that lots of code gets run in goroutines and there's lots of model data that needs to be read, meaning it would be hard to pass it all in (and to pass it around to helper functions). I suspect that using a read-write mutex would end up being less complicated, though still noisy.
But like I said it's hard to know without just trying something and seeing how the result looks
When running this test in a loop, it fails roughly 1 out of 5 times with the following stack:
I think this is a concurrency issue: the commits list is being re-rendered at the end of
refreshBranches
, butrefreshBranches
only locks theRefreshingBranchesMutex
, but not theLocalCommitsMutex
. The following patch seems to fix the problem:However, I'm not sure this is the right fix. To be honest, I'm very confused at the use of model mutexes in lazygit; it seems that only mutation of model data is protected by mutexes, but not use of that data. For example,
RefreshingBranchesMutex
is only locked inrefreshBranches
and nowhere else, so it only protects the mutation of the branches model (and the one re-render that happens inside that function). However, the branches list can be rendered from other places, and there's a lot of other code that uses the branches model; shouldn't all these places lock the mutex as well?