Integration test submodule/reset.go often fails

stefanhaller commented 1 year ago

When running this test in a loop, it fails roughly 1 out of 5 times with the following stack:

panic: runtime error: slice bounds out of range [:2] with capacity 1

goroutine 582 [running]:
github.com/jesseduffield/lazygit/pkg/gui/presentation.GetCommitListDisplayStrings(0x140002f2d80, {0x1400000e140?, 0x1, 0x1}, {0x1400080e5f0, 0x2, 0x2}, {0x140002e6000, 0x4}, 0x0, ...)
        /Users/stk/Stk/Dev/Builds/lazygit/pkg/gui/presentation/commits.go:74 +0x708
github.com/jesseduffield/lazygit/pkg/gui/context.NewLocalCommitsContext.func2(0x1052f5f20?, 0x1400080ebe0?)
        /Users/stk/Stk/Dev/Builds/lazygit/pkg/gui/context/local_commits_context.go:43 +0x380
github.com/jesseduffield/lazygit/pkg/gui/context.(*ListRenderer).renderLines(0x1400014a510, 0xffffffffffffffff, 0xffffffffffffffff)
        /Users/stk/Stk/Dev/Builds/lazygit/pkg/gui/context/list_renderer.go:88 +0x160
github.com/jesseduffield/lazygit/pkg/gui/context.(*ListContextTrait).HandleRender(0x1400014a500)
        /Users/stk/Stk/Dev/Builds/lazygit/pkg/gui/context/list_context_trait.go:88 +0x44
github.com/jesseduffield/lazygit/pkg/gui/controllers/helpers.(*RefreshHelper).refreshBranches(0x1400072b340)
        /Users/stk/Stk/Dev/Builds/lazygit/pkg/gui/controllers/helpers/refresh_helper.go:452 +0x38c
github.com/jesseduffield/lazygit/pkg/gui/controllers/helpers.(*RefreshHelper).refreshReflogAndBranches(0x1?)
        /Users/stk/Stk/Dev/Builds/lazygit/pkg/gui/controllers/helpers/refresh_helper.go:257 +0x28
github.com/jesseduffield/lazygit/pkg/gui/controllers/helpers.(*RefreshHelper).Refresh.func2.1.1({0x140003137a0?, 0x14000313740?})
        /Users/stk/Stk/Dev/Builds/lazygit/pkg/gui/controllers/helpers/refresh_helper.go:105 +0x24
github.com/jesseduffield/gocui.(*Gui).onWorkerAux(0x0?, 0x0?, {0x10542a9d0?, 0x14000a90320?})
        /Users/stk/Stk/Dev/Builds/lazygit/vendor/github.com/jesseduffield/gocui/gui.go:680 +0x6c
github.com/jesseduffield/gocui.(*Gui).OnWorker.func1()
        /Users/stk/Stk/Dev/Builds/lazygit/vendor/github.com/jesseduffield/gocui/gui.go:667 +0x34
created by github.com/jesseduffield/gocui.(*Gui).OnWorker
        /Users/stk/Stk/Dev/Builds/lazygit/vendor/github.com/jesseduffield/gocui/gui.go:666 +0x9c

I think this is a concurrency issue: the commits list is being re-rendered at the end of refreshBranches, but refreshBranches only locks the RefreshingBranchesMutex, but not the LocalCommitsMutex. The following patch seems to fix the problem:

diff --git a/pkg/gui/controllers/helpers/refresh_helper.go b/pkg/gui/controllers/helpers/refresh_helper.go
index 578089af1..f04b102e4 100644
--- a/pkg/gui/controllers/helpers/refresh_helper.go
+++ b/pkg/gui/controllers/helpers/refresh_helper.go
@@ -449,9 +449,11 @@ func (self *RefreshHelper) refreshBranches() {

    // Need to re-render the commits view because the visualization of local
    // branch heads might have changed
+   self.c.Mutexes().LocalCommitsMutex.Lock()
    if err := self.c.Contexts().LocalCommits.HandleRender(); err != nil {
        self.c.Log.Error(err)
    }
+   self.c.Mutexes().LocalCommitsMutex.Unlock()

    self.refreshStatus()
 }

However, I'm not sure this is the right fix. To be honest, I'm very confused at the use of model mutexes in lazygit; it seems that only mutation of model data is protected by mutexes, but not use of that data. For example, RefreshingBranchesMutex is only locked in refreshBranches and nowhere else, so it only protects the mutation of the branches model (and the one re-render that happens inside that function). However, the branches list can be rendered from other places, and there's a lot of other code that uses the branches model; shouldn't all these places lock the mutex as well?

stefanhaller commented 1 year ago

I read up on concurrency in go a bit more, and I'm getting dizzy. 😄

I just discovered the "-race" command line option of go, which seems very useful. For example, running lazygit as go run -race main.go (or rather something like GORACE="log_path=/tmp/lazygit-race" go run -race main.go) logs quite a large number of data races. I feel we have a lot of work to do to get our use of mutexes right...

Curious about your thoughts on this @jesseduffield.

jesseduffield commented 1 year ago

Yep, our concurrency model is shocking haha. You can use a read-write mutex which allows multiple reads at once but exclusive writes, but I don't like the idea of adding a bunch of locks and unlocks everywhere, nor do I have a good idea what a more callback approach would look like (where the mutex stuff is encapsulated).

I do try to make it so that model objects are replaced rather than mutated when we refresh, and I try to assign a model object to a variable and then use that variable in a keybinding handler so that if a refresh happens we might be working with stale data but we won't get a slice bounds error

But it would be good to structure things in a way where we guarantee that while you're in a handler, the model won't change. All model changes should happen in the refresh helper so that sounds doable to me. Other forms of mutation like gui state and context state I'm less clear on how to fix that.

But I would love to solve this problem and have that race detector find no issues so we can add it as a CI step

jesseduffield commented 1 year ago

Looking closer at the code, I think the invariant that a given model slice won't be mutated does hold in this case, but we can't make use of that fact because in getDisplayStrings we use c.Model().Commits which may be a different slice to the one we worked with when obtaining the end index. We can fix this in the short term by just clamping the end index again inside presentation.GetCommitListDisplayStrings before we use it (I've confirmed this works).

diff --git a/pkg/gui/presentation/commits.go b/pkg/gui/presentation/commits.go
index b4297f6ed..1ecc413e8 100644
--- a/pkg/gui/presentation/commits.go
+++ b/pkg/gui/presentation/commits.go
@@ -71,6 +71,10 @@ func GetCommitListDisplayStrings(
    // this is where my non-TODO commits begin
    rebaseOffset := utils.Min(indexOfFirstNonTODOCommit(commits), endIdx)

+   // endIdx may have been based on a prior instance of the commits slice so we
+   // clamp it here to ensure it's within bounds
+   endIdx = utils.Clamp(endIdx, 0, len(commits))
+
    filteredCommits := commits[startIdx:endIdx]

    bisectBounds := getbisectBounds(commits, bisectInfo)

stefanhaller commented 1 year ago

I do try to make it so that model objects are replaced rather than mutated when we refresh, and I try to assign a model object to a variable and then use that variable in a keybinding handler

That's good, but both of these only reduce the likelihood of it being a problem. The race is still there.

I really see only two ways to solve this:

put locks everywhere (also when reading the state). I don't like this idea either, but what can you do.
ensure that the data is only read and written from the main thread. For the handlers this mostly seems to be the case as far as I can see; if they offload work into a go routine, they must pass the model data to it. For the refresh, we'd just need to have an onUIThread that sets the data. The downside is that you can no longer do a sync refresh and then repaint to ensure that there are no UI glitches, like we do here. This makes it a deal-breaker for me.

We can fix this in the short term by just clamping the end index again inside presentation.GetCommitListDisplayStrings before we use it

I really don't think this is an appropriate fix. It only works around the panic, but it doesn't really fix the behavior. What if the slice gets longer rather than shorter? We're only drawing half of the view in that case. It's just not right.

jesseduffield commented 1 year ago

Fair point: that would indeed only fix panics.

As for the bigger fix, perhaps we should do some research into how other gui frameworks (e.g. tview) handle concurrency and see how easily we could adapt the same approach

stefanhaller commented 1 year ago

Tview has a documentation section about this, but it doesn't say much except that most of tview is not thread-safe, and doesn't have to because you typically interact with it on the main thread only. It doesn't say anything about model data, it seems it's simply the client's responsibility to deal with this in whatever way they want.

stefanhaller commented 1 year ago

I'm coming back to this now. I made a PR (#3019) that allows running integration tests with the -race flag, and this shows tons of concurrency problems, not just for model data but also for views.

I spent a bit of time experimenting with more locking, for example to extend gocui.View's writeMutex to a general mutex that protects all its fields, not just the lines buffer. And also to put more locks around reading model data, as discussed above. I quickly got the feeling that this is not a viable approach; it's just too messy to get all the locking right.

So next I'm planning to experiment with the opposite approach: make it so all model data (and maybe even views?) is read and written only from the main thread. Concretely this means that refresh can still happen in a goroutine (e.g. calculate a new slice of model commits), but then the final assignment to the Model struct needs to be done with an OnUiThread. And the PostRefreshUpdate too. This should work as long as we make sure we never mutate model slices, only assign new ones.

As I said above, this makes it impossible to do a sync refresh and repaint to ensure you can do multiple ctrl-j in quick succession without glitches; but I suppose this could be solved by introducing a new "UI is blocked" mode where we don't dispatch commands but buffer them until the mode is over. This would allow ctrl-j to use a normal WithWaitingStatus instead of the hacked WithWaitingStatusSync that I added in #2966; I think that would be cleaner anyway.

I'd like to hear your thoughts though before I start, as I expect it to be a pretty huge amount of work. Do you think it's worth trying? Can you foresee any problems that I might be missing?

jesseduffield commented 1 year ago

Centralizing the writing of model data to one place is easy, but centralizing the reading will be very complex. It would be worth just testing it out for a single model e.g. commits and see how many touchpoints there are. I'd also use a read-write mutex which allows multiple concurrent reads so long as there's no writes

stefanhaller commented 1 year ago

but centralizing the reading will be very complex

Why do you think so? Maybe I'm naïve, but my thinking was that whenever a go routine needs to look at model data, it needs to be passed to it in a variable.

My hope would be that this allows us to get rid of the model mutexes altogether.

jesseduffield commented 11 months ago

Sorry forgot to respond to this. My assumption is that lots of code gets run in goroutines and there's lots of model data that needs to be read, meaning it would be hard to pass it all in (and to pass it around to helper functions). I suspect that using a read-write mutex would end up being less complicated, though still noisy.

But like I said it's hard to know without just trying something and seeing how the result looks

jesseduffield / lazygit

Integration test submodule/reset.go often fails #2974