Closed syrel closed 4 months ago
@tesonep @guillep @Ducasse
Would it please be possible to give us any feedback ?
In one of our sub projects we hit this problem consistently. We have a fix but we do not understand the reasoning of the original change.
Hi we will have a look and come back to you. Now the week of 2 working days are over.
Thx.
Hi, the reason to having this is to allow to have really long idle periods. Idle periods are interrupted by a socket / file operation or by a signalling semaphore. In which case are you having issues with the implementation of interruptAIOPoll, as write is thread-safe (it does not guarantee order, but we don't care, we just want to put some data in the pipe so the poll / select is interrupted). It was added to have a long idle VM, what is the case scenario that you are having and how do you arrive to it? Because we are using it with multiple threads signalling and don't having the issue. E.g., FFI callbacks in an idle VM (a VM that is waiting for really long time in the relinquishProcessor primitive).
Have you tested that the problem is not a timing issue with your code calling the signalling of the semaphore, as this implementation will resume the execution of the VM thread faster than the older one? If you remove this mechanism the VM thread will not resume inmediately but it will do it after the relinquishProcessor time ended.
Again if you can provide us an example we can see the issue, but this change has been since 2019 and we have not seen problems using it in multi threading applications
Hi @tesonep
Thanks for the answer. interruptAIOPoll
on osx has the following line:
interruptFIFOMutex->wait(interruptFIFOMutex)
signalSemaphoreWithIndex
will never return If interruptFIFOMutex
is not signalled.
When signalSemaphoreWithIndex
is called in parallel from both VM and another thread it deadlocks on interruptFIFOMutex
:
I don't see code that can be interrupted before signalling the semaphore. All the exclusions zones are using only variables. Do you have changes to the event handling mechanism of the VM?
Hi @tesonep It can be reproduced in any Pharo since 2019. I have created a repo with a minimal reproducible example: https://github.com/syrel/pharo-vm-804
There are just a few steps (see the Readme.md) and you are good to go. Pharo 10, 11, 12, 13, all deadlock in a similar way.
GitHubIssue #804. Contribute to syrel/pharo-vm-804 development by creating an account on GitHub.
Hi, thanks for the example. I have reproduce it and I understand what is happening. The problem is with the interaction of signals.
I will recommend to change the usage of signals to communicate with Pharo External semaphores or callbacks. Making it work requires that signals are handled in safe points of the VM, the OSUnixProcess Plugin is not intended for that and might fail when the signal is frequently used.
Sadly, this is not an issue that is a priority for us, so I will not provide a solution in the short time. If there is urgency about this, please feel free to submit a PR or contact us to use time through the support of the Consortium (members engineering time or a custom contract if not enough).
Hi @tesonep Thank you for the explanation 👍 I think we can close this
Thanks for the example. It helped understanding the problem.
Hi
https://github.com/pharo-project/pharo-vm/commit/ce69c3eec32b013b4b4f67ed306d70c831157555 introduced a call to
interruptAIOPoll
in signalSemaphoreWithIndex. The problem is that it can deadlock when signalling semaphores from multiple threads.https://github.com/pharo-project/pharo-vm/blob/ce69c3eec32b013b4b4f67ed306d70c831157555/extracted/vm/src/common/sqExternalSemaphores.c#L195
signalSemaphoreWithIndex
should be safe to use from multiple threads and even signal the same semaphore:I couldn't find an issue for the original change. What was the intention for the change?
The implementation in opensmalltalk-vm doesn't call
interruptAIOPoll
. This is becauseinterruptAIOPoll
is not thread safe and shouldn't be used insignalSemaphoreWithIndex
.Cheers