Closed ansoni-san closed 1 year ago
I have a very simplistic fix working while maintaining the reference counting. We simply change the refs field from uint64
to int64
, and useatomic.AddInt64
to avoid the race/deadlock.
Simply allow the counter to go below zero, which shouldn't happen anyway, so that we don't have to do anything complicated for such a simple check. Any refs
count of 0 or below is treated as completed.
What do you think?
https://github.com/ansoni-san/winrt-go/commit/7da1c6b01f401b1f502cd788426d8075af7b0c26
Thanks @ansoni-san for the issue and the detailed report! As this would require us to take a deeper look it's gonna take more time. Will keep you posted.
Thanks for looking into this.
Just to clarify the above, although it may be obvious, this is for write requests (write with response).
Thanks.
The thing that bothers me is that the error is happening in the NewAsyncOperationCompletedHandler
function, not in a callback.
The sync: inconsistent mutex state
error is usually related to memory corruption. But the only thing we are doing there is calling the C.malloc
function and then calling the Lock
that fails without checking if the malloc
failed or not.
Could you send us a simple example that replicate the issue? I would like to understand what the underlying problem is.
Ah. I assumed the stack trace was wrong. Interesting.
Iβll debug further to understand it and then see if I can reliably trigger it in a small example application.
Thanks Jagoba On May 11, 2023 at 15:41 +0800, Jagoba GascΓ³n @.***>, wrote:
The thing that bothers me is that the error is happening in the NewAsyncOperationCompletedHandler function, not in a callback. The sync: inconsistent mutex state error is usually related to memory corruption. But the only thing we are doing there is calling the C.malloc function and then calling the Lock that fails without checking if the malloc failed or not. Could you send us a simple example that replicate the issue? I would like to understand what the underlying problem is. β Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
I think the problem here is that winrt_AsyncOperationCompletedHandler_Release
and winrt_AsyncOperationCompletedHandler_AddRef
are exported, so will be directly called from outside of go-land.
Go may not be able to control when the calls back into that code happen causing the lock/unlocking to be inconsistent.
I think Mutex locking may be dangerous/racey in these types of system callbacks.
The stack trace you provided does not involve any callbacks. Could you tell us the specs of the computer you are running this?
If I limit the memory available for Go, then I can consistently reproduce the issue.
I'm running this code with the GOMEMLIMIT="10KiB"
environment variable:
func main() {
g := ole.NewGUID(foundation.GUIDIAsyncOperation)
for i := 0; i < 100_000; i++ {
a := foundation.NewAsyncOperationCompletedHandler(g, func(instance *foundation.AsyncOperationCompletedHandler, asyncInfo *foundation.IAsyncOperation, asyncStatus foundation.AsyncStatus) {
})
defer a.Release()
}
println("This will fail before reaching this point!")
}
And it consistently fails with the exact same stack trace you provided:
fatal error: sync: inconsistent mutex state
goroutine 1 [running]:
runtime.throw({0x7ff6d704dbe8?, 0x33b00000001e?})
C:/Program Files/Go/src/runtime/panic.go:1047 +0x65 fp=0xc000059e48 sp=0xc000059e18 pc=0x7ff6d6fb4905
sync.throw({0x7ff6d704dbe8?, 0x200000003?})
C:/Program Files/Go/src/runtime/panic.go:1026 +0x1e fp=0xc000059e68 sp=0xc000059e48 pc=0x7ff6d6fd9cde
sync.(*Mutex).lockSlow(0x1377eb03418)
C:/Program Files/Go/src/sync/mutex.go:158 +0xe5 fp=0xc000059eb8 sp=0xc000059e68 pc=0x7ff6d6fe4925
sync.(*Mutex).Lock(...)
C:/Program Files/Go/src/sync/mutex.go:90
github.com/saltosystems/winrt-go/windows/foundation.(*AsyncOperationCompletedHandler).addRef(0x7ff6d70390c0?)
C:/Users/jagoba/go/pkg/mod/github.com/saltosystems/winrt-go@v0.0.0-20230510070731-e096b9afa761/windows/foundation/asyncoperationcompletedhandler.go:76 +0x5b fp=0xc000059f08 sp=0xc000059eb8 pc=0x7ff6d700873b
github.com/saltosystems/winrt-go/windows/foundation.NewAsyncOperationCompletedHandler(0xc00014c020, 0x7ff6d7051878)
C:/Users/jagoba/go/pkg/mod/github.com/saltosystems/winrt-go@v0.0.0-20230510070731-e096b9afa761/windows/foundation/asyncoperationcompletedhandler.go:70 +0xd1 fp=0xc000059f38 sp=0xc000059f08 pc=0x7ff6d70086b1
main.fails()
C:/Users/jagoba/prueba-winrt/main.go:26 +0x5a fp=0xc000059f70 sp=0xc000059f38 pc=0x7ff6d700971a
main.main()
C:/Users/jagoba/prueba-winrt/main.go:10 +0x1c fp=0xc000059f80 sp=0xc000059f70 pc=0x7ff6d70095fc
runtime.main()
C:/Program Files/Go/src/runtime/proc.go:250 +0x1f7 fp=0xc000059fe0 sp=0xc000059f80 pc=0x7ff6d6fb7077
runtime.goexit()
C:/Program Files/Go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000059fe8 sp=0xc000059fe0 pc=0x7ff6d6fde261
And if I remove the defer
from the Release()
, then the error dissapears.
Thanks for the details @jagobagascon.
If the defer is called within a loop, the amount of method to be called is chained in the stack. Thus if a big amount of defers are chained we reach to some Go limit then. This common for all applications and not for winrt-go
so, @ansoni-san would you mind giving us more details about this and the specific use case?
Hi All,
Thanks for the responses!
The machine is an i7-9750H with 16GB of ram, with about 9GB free when this occurs. Go version is 1.18 with no memory limits set. The process is using less than 100Mib of memory when this occurs (it happens reliably but at random timing, sometimes within 2 seconds, sometimes after 40 seconds or more).
I am sure this must be to do with how the WinRT side jumps back into callbacks in real usage with a device. A simple mutex lock that engages at multiple points in the same asynchronous flow may be dangerous, like with interrupts, as you're passing a pointer to the completed handler to the syscall. The memory corruption would make sense in that case.
From my testing it is triggered by the timing of the callbacks. Can sometimes happen very fast, or can sometimes take a while. There is no visible pattern I can see to do with memory usage (there virtually isn't any noticeable memory usage).
Have you tried testing with a real device, using *GattCharacteristic.WriteValueWithOptionAsync
, and then creating a completion handler on it, repeating many times per second? π€
The issue can occur very fast, or sometimes take some time. It takes a different amount of time every time for the same sequence of calls.
This seems to indicate that there is a timing issue that causes the memory corruption.
Edit: I ran your test and it completes successfully ππ½
I also ran a modified version to be more accurate to my use case, which is that every call happens in a new go routine. This completes successfully as well.
func main() {
g := ole.NewGUID(foundation.GUIDIAsyncOperation)
wg := sync.WaitGroup{}
for i := 0; i < 100_000; i++ {
wg.Add(1)
go func() {
a := foundation.NewAsyncOperationCompletedHandler(g, func(instance *foundation.AsyncOperationCompletedHandler, asyncInfo *foundation.IAsyncOperation, asyncStatus foundation.AsyncStatus) {
})
defer a.Release()
defer wg.Done()
}()
}
wg.Wait()
println("This will fail before reaching this point!")
}
I think to reproduce this you would need to be actively interacting with WinRT. Calling *GattCharacteristic.WriteValueWithOptionAsync
and then creating a completed handler, many times a second like tinygo-go/bluetooth is doing.
Ultimately the lock is unnecessary for such a simple counter as it can be maintained atomically, which makes the runtime behaviour more predictable and stable for this type of API.
It would be nice to figure out why it's causing an issue though. Everything works perfectly when making high-frequency calls without this locking.
@ansoni-san I ran your code and it was failing for me π , even without the defers. So I started debugging and I'm quite sure I found the issue.
The WinRT delegates are being instantiated in the heap using malloc: https://github.com/saltosystems/winrt-go/blob/2c400ab3bdecc9b81db1ccff829d24c00f059d79/windows/foundation/asyncoperationcompletedhandler.go#L67-L69
But we are not explicitly initialising all the properties, which means that their initial value may contain garbage from the memory allocated. This can cause the refs
and the Mutex
itself to be initialised with random garbage data, which would explain the memory corruption and the inconsistent mutex state
error.
After initialising these fields then the error stops happening.
inst.Mutex = sync.Mutex{}
inst.refs = 0
I added this PR with the change: https://github.com/saltosystems/winrt-go/pull/74
@ansoni-san if you could try this branch in your code before merging it would be awesome π , that way we can make sure that this is what's causing all your errors.
Aah yes! I did think of something like this before, but I was looking at whether the pointer to the zero value could be being shared (as Go doesn't guarantee their uniqueness, in fact I think it likes to reuse them).
But this makes much more sense.
Thank you for looking into this πππ½
I will test this out this evening.
@jagobagascon Yes, this fixes it.
I didn't notice the malloc
Thanks for fixing this. We can drop our internal version now.
Thank you @ansoni-san for the detailed report and follow-ups
Thank you for the help @ansoni-san !
I already opened a PR in https://github.com/tinygo-org/bluetooth/pull/173 to bump the WinRT-Go library to latest.
Currently running into this issue:
By sending many commands per second at differing intervals
When I execute this for a few different timing patterns, after about 20-60 seconds I consistently hit this error.
Removing the locking from
AsyncOperationCompletedHandler
addRef
andremoveRef
solves the problem.Wondering if there is a better way to handle this without any locking (since I imagine it is not safe in this kind of asynchronous system callback).