fatal error: sync: inconsistent mutex state

ansoni-san commented 1 year ago

Currently running into this issue:

fatal error: sync: inconsistent mutex state

goroutine 68 [running, locked to thread]:
runtime.throw({0x2340c6?, 0x5891cc0bf2b4?})
        C:/Program Files/Go/src/runtime/panic.go:992 +0x76 fp=0xc0000cbb68 sp=0xc0000cbb38 pc=0xd6396
sync.throw({0x2340c6?, 0xfbc85?})
        C:/Program Files/Go/src/runtime/panic.go:978 +0x1e fp=0xc0000cbb88 sp=0xc0000cbb68 pc=0xfbafe
sync.(*Mutex).lockSlow(0x1f622265338)
        C:/Program Files/Go/src/sync/mutex.go:149 +0xe5 fp=0xc0000cbbd8 sp=0xc0000cbb88 pc=0x1147a5
sync.(*Mutex).Lock(...)
        C:/Program Files/Go/src/sync/mutex.go:81
github.com/saltosystems/winrt-go/windows/foundation.(*AsyncOperationCompletedHandler).addRef(0x20dc40?)
        C:/Users/DF/git/winrt-go/windows/foundation/asyncoperationcompletedhandler.go:76 +0x5b fp=0xc0000cbc28 sp=0xc0000cbbd8 pc=0x18683b
github.com/saltosystems/winrt-go/windows/foundation.NewAsyncOperationCompletedHandler(0xc0002a2a80, 0xc000328210)
        C:/Users/DF/git/winrt-go/windows/foundation/asyncoperationcompletedhandler.go:70 +0xd2 fp=0xc0000cbc58 sp=0xc0000cbc28 pc=0x1867b2
tinygo.org/x/bluetooth.awaitAsyncOperation(0x1f622448230?, {0x239f0e, 0x52})
        C:/Users/DF/git/go-bluetooth/adapter_windows.go:43 +0x10b fp=0xc0000cbce0 sp=0xc0000cbc58 pc=0x18c9cb
tinygo.org/x/bluetooth.DeviceCharacteristic.write({{0x0, 0xb0000000, 0x4514000, 0xf000c000}, 0x1f622448230, 0xa, 0xc00009e180}, {0xc000368f38, 0x4, 0x8}, ...)
        C:/Users/DF/git/go-bluetooth/gattc_windows.go:296 +0x133 fp=0xc0000cbd60 sp=0xc0000cbce0 pc=0x18eb93
tinygo.org/x/bluetooth.DeviceCharacteristic.Write({{0x0, 0xb0000000, 0x4514000, 0xf000c000}, 0x1f622448230, 0xa, 0xc00009e180}, {0xc000368f38?, 0x15e189?, 0x22b9ff?})
        C:/Users/DF/git/go-bluetooth/gattc_windows.go:261 +0x65 fp=0xc0000cbdb8 sp=0xc0000cbd60 pc=0x18e985
main.(*GenericDriver).SendCommand(0xc0000ba0e0, {0x22bc0c, 0x7}, {0xc000368f38, 0x4, 0x8})

By sending many commands per second at differing intervals

command
wait 100ms
command
wait 200ms

When I execute this for a few different timing patterns, after about 20-60 seconds I consistently hit this error.

Removing the locking from AsyncOperationCompletedHandler addRefand removeRefsolves the problem.

Wondering if there is a better way to handle this without any locking (since I imagine it is not safe in this kind of asynchronous system callback).

ansoni-san commented 1 year ago

I have a very simplistic fix working while maintaining the reference counting. We simply change the refs field from uint64to int64, and useatomic.AddInt64 to avoid the race/deadlock.

Simply allow the counter to go below zero, which shouldn't happen anyway, so that we don't have to do anything complicated for such a simple check. Any refs count of 0 or below is treated as completed.

What do you think?

https://github.com/ansoni-san/winrt-go/commit/7da1c6b01f401b1f502cd788426d8075af7b0c26

glerchundi commented 1 year ago

Thanks @ansoni-san for the issue and the detailed report! As this would require us to take a deeper look it's gonna take more time. Will keep you posted.

ansoni-san commented 1 year ago

Thanks for looking into this.

Just to clarify the above, although it may be obvious, this is for write requests (write with response).

Thanks.

jagobagascon commented 1 year ago

The thing that bothers me is that the error is happening in the NewAsyncOperationCompletedHandler function, not in a callback.

The sync: inconsistent mutex state error is usually related to memory corruption. But the only thing we are doing there is calling the C.malloc function and then calling the Lock that fails without checking if the malloc failed or not.

Could you send us a simple example that replicate the issue? I would like to understand what the underlying problem is.

ansoni-san commented 1 year ago

Ah. I assumed the stack trace was wrong. Interesting.

I’ll debug further to understand it and then see if I can reliably trigger it in a small example application.

Thanks Jagoba On May 11, 2023 at 15:41 +0800, Jagoba Gascón @.***>, wrote:

The thing that bothers me is that the error is happening in the NewAsyncOperationCompletedHandler function, not in a callback. The sync: inconsistent mutex state error is usually related to memory corruption. But the only thing we are doing there is calling the C.malloc function and then calling the Lock that fails without checking if the malloc failed or not. Could you send us a simple example that replicate the issue? I would like to understand what the underlying problem is. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

ansoni-san commented 1 year ago

I think the problem here is that winrt_AsyncOperationCompletedHandler_Release and winrt_AsyncOperationCompletedHandler_AddRef are exported, so will be directly called from outside of go-land.

Go may not be able to control when the calls back into that code happen causing the lock/unlocking to be inconsistent.

I think Mutex locking may be dangerous/racey in these types of system callbacks.

jagobagascon commented 1 year ago

The stack trace you provided does not involve any callbacks. Could you tell us the specs of the computer you are running this?

If I limit the memory available for Go, then I can consistently reproduce the issue.

I'm running this code with the GOMEMLIMIT="10KiB" environment variable:

func main() {
    g := ole.NewGUID(foundation.GUIDIAsyncOperation)
    for i := 0; i < 100_000; i++ {
        a := foundation.NewAsyncOperationCompletedHandler(g, func(instance *foundation.AsyncOperationCompletedHandler, asyncInfo *foundation.IAsyncOperation, asyncStatus foundation.AsyncStatus) {
        })
        defer a.Release()
    }
    println("This will fail before reaching this point!")
}

And it consistently fails with the exact same stack trace you provided:

fatal error: sync: inconsistent mutex state

goroutine 1 [running]:
runtime.throw({0x7ff6d704dbe8?, 0x33b00000001e?})
        C:/Program Files/Go/src/runtime/panic.go:1047 +0x65 fp=0xc000059e48 sp=0xc000059e18 pc=0x7ff6d6fb4905
sync.throw({0x7ff6d704dbe8?, 0x200000003?})
        C:/Program Files/Go/src/runtime/panic.go:1026 +0x1e fp=0xc000059e68 sp=0xc000059e48 pc=0x7ff6d6fd9cde
sync.(*Mutex).lockSlow(0x1377eb03418)
        C:/Program Files/Go/src/sync/mutex.go:158 +0xe5 fp=0xc000059eb8 sp=0xc000059e68 pc=0x7ff6d6fe4925
sync.(*Mutex).Lock(...)
        C:/Program Files/Go/src/sync/mutex.go:90
github.com/saltosystems/winrt-go/windows/foundation.(*AsyncOperationCompletedHandler).addRef(0x7ff6d70390c0?)
        C:/Users/jagoba/go/pkg/mod/github.com/saltosystems/winrt-go@v0.0.0-20230510070731-e096b9afa761/windows/foundation/asyncoperationcompletedhandler.go:76 +0x5b fp=0xc000059f08 sp=0xc000059eb8 pc=0x7ff6d700873b
github.com/saltosystems/winrt-go/windows/foundation.NewAsyncOperationCompletedHandler(0xc00014c020, 0x7ff6d7051878)
        C:/Users/jagoba/go/pkg/mod/github.com/saltosystems/winrt-go@v0.0.0-20230510070731-e096b9afa761/windows/foundation/asyncoperationcompletedhandler.go:70 +0xd1 fp=0xc000059f38 sp=0xc000059f08 pc=0x7ff6d70086b1     
main.fails()
        C:/Users/jagoba/prueba-winrt/main.go:26 +0x5a fp=0xc000059f70 sp=0xc000059f38 pc=0x7ff6d700971a
main.main()
        C:/Users/jagoba/prueba-winrt/main.go:10 +0x1c fp=0xc000059f80 sp=0xc000059f70 pc=0x7ff6d70095fc
runtime.main()
        C:/Program Files/Go/src/runtime/proc.go:250 +0x1f7 fp=0xc000059fe0 sp=0xc000059f80 pc=0x7ff6d6fb7077
runtime.goexit()
        C:/Program Files/Go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000059fe8 sp=0xc000059fe0 pc=0x7ff6d6fde261

And if I remove the defer from the Release(), then the error dissapears.

glerchundi commented 1 year ago

Thanks for the details @jagobagascon.

If the defer is called within a loop, the amount of method to be called is chained in the stack. Thus if a big amount of defers are chained we reach to some Go limit then. This common for all applications and not for winrt-go so, @ansoni-san would you mind giving us more details about this and the specific use case?

ansoni-san commented 1 year ago

Hi All,

Thanks for the responses!

The machine is an i7-9750H with 16GB of ram, with about 9GB free when this occurs. Go version is 1.18 with no memory limits set. The process is using less than 100Mib of memory when this occurs (it happens reliably but at random timing, sometimes within 2 seconds, sometimes after 40 seconds or more).

I am sure this must be to do with how the WinRT side jumps back into callbacks in real usage with a device. A simple mutex lock that engages at multiple points in the same asynchronous flow may be dangerous, like with interrupts, as you're passing a pointer to the completed handler to the syscall. The memory corruption would make sense in that case.

From my testing it is triggered by the timing of the callbacks. Can sometimes happen very fast, or can sometimes take a while. There is no visible pattern I can see to do with memory usage (there virtually isn't any noticeable memory usage).

Have you tried testing with a real device, using *GattCharacteristic.WriteValueWithOptionAsync, and then creating a completion handler on it, repeating many times per second? 🤔

ansoni-san commented 1 year ago

The issue can occur very fast, or sometimes take some time. It takes a different amount of time every time for the same sequence of calls.

This seems to indicate that there is a timing issue that causes the memory corruption.

Edit: I ran your test and it completes successfully 👍🏽

I also ran a modified version to be more accurate to my use case, which is that every call happens in a new go routine. This completes successfully as well.

func main() {
    g := ole.NewGUID(foundation.GUIDIAsyncOperation)
    wg := sync.WaitGroup{}
    for i := 0; i < 100_000; i++ {
        wg.Add(1)
        go func() {
            a := foundation.NewAsyncOperationCompletedHandler(g, func(instance *foundation.AsyncOperationCompletedHandler, asyncInfo *foundation.IAsyncOperation, asyncStatus foundation.AsyncStatus) {
            })
            defer a.Release()
            defer wg.Done()
        }()
    }
    wg.Wait()
    println("This will fail before reaching this point!")
}

I think to reproduce this you would need to be actively interacting with WinRT. Calling *GattCharacteristic.WriteValueWithOptionAsync and then creating a completed handler, many times a second like tinygo-go/bluetooth is doing.

Ultimately the lock is unnecessary for such a simple counter as it can be maintained atomically, which makes the runtime behaviour more predictable and stable for this type of API.

It would be nice to figure out why it's causing an issue though. Everything works perfectly when making high-frequency calls without this locking.

jagobagascon commented 1 year ago

@ansoni-san I ran your code and it was failing for me 😅, even without the defers. So I started debugging and I'm quite sure I found the issue.

The WinRT delegates are being instantiated in the heap using malloc: https://github.com/saltosystems/winrt-go/blob/2c400ab3bdecc9b81db1ccff829d24c00f059d79/windows/foundation/asyncoperationcompletedhandler.go#L67-L69

But we are not explicitly initialising all the properties, which means that their initial value may contain garbage from the memory allocated. This can cause the refs and the Mutex itself to be initialised with random garbage data, which would explain the memory corruption and the inconsistent mutex state error.

After initialising these fields then the error stops happening.

inst.Mutex = sync.Mutex{}
inst.refs = 0

I added this PR with the change: https://github.com/saltosystems/winrt-go/pull/74

@ansoni-san if you could try this branch in your code before merging it would be awesome 😄 , that way we can make sure that this is what's causing all your errors.

ansoni-san commented 1 year ago

Aah yes! I did think of something like this before, but I was looking at whether the pointer to the zero value could be being shared (as Go doesn't guarantee their uniqueness, in fact I think it likes to reuse them).

But this makes much more sense.

Thank you for looking into this 😁👍🏽

I will test this out this evening.

ansoni-san commented 1 year ago

@jagobagascon Yes, this fixes it.

I didn't notice the malloc

Thanks for fixing this. We can drop our internal version now.

glerchundi commented 1 year ago

Thank you @ansoni-san for the detailed report and follow-ups

jagobagascon commented 1 year ago

Thank you for the help @ansoni-san !

I already opened a PR in https://github.com/tinygo-org/bluetooth/pull/173 to bump the WinRT-Go library to latest.

saltosystems / winrt-go

fatal error: sync: inconsistent mutex state #71