Open quantimnot opened 2 years ago
@quantimnot Have you tried with Nim 2.0 ? So far I haven't been able to reproduce the bug.
Hey @Clonkk, I was able to reproduce the problem with the latest Nim versions on my macOS 14 M2 computer. I just spent two days of my free time trying to debug the issue, and I must stop myself from spending anymore time on this. I don't use this library, and my computing feels kinda rusty at the moment.
Here is a test harness I was working with:
#!/bin/sh
# SPDX-License-Identifier: Unlicense
#[
ZMQLIB_HOMEBREW="$(pkg-config libzmq --variable libdir)"
ZMQLIB_WITH_DEBUG_SYMBOLS="${PORTS:-}/org.zeromq.libzmq/tree/upstream/obj/lib"
ZMQLIB_DIR=${ZMQLIB_HOMEBREW}
case "$(uname)" in
Darwin) export DYLD_LIBRARY_PATH="$ZMQLIB_DIR";;
*) echo "WARNING: THIS ISSUE WAS OBSERVED ON A DARWIN OS.";;
esac
set -eu pipefail
COMMON_ARGS="--outdir:. --nimcache:nimcache --threads:on -d:threadsafe"
DEBUGEE_PATH=./issue36
nim $COMMON_ARGS c --debugger:native -d:issue36 -o:"$DEBUGEE_PATH" "$0"
nim $COMMON_ARGS r "$0" "$DEBUGEE_PATH"
exit
]#
when defined issue36:
import ./tests/tzmq {.all.}
#[
OS: macOS
This test intermittently hangs on both an old Intel Macbook, and an Arm M2 Macbook.
The main thread was blocked on ```libsystem_kernel.dylib`poll```
Whenever the process was interrupted and then continued with lldb, it would give this error:
tzmq.nim(310) tzmq
connections.nim(133) asyncpoll
connections.nim(135) =destroy
Unhandled exception: Connection from/to tcp://127.0.0.1:15571 was destroyed but not closed. [ZmqError]
```
Now, after installing macOS 14.4, the test does NOT hang. It always goes straight to the above error.
]#
asyncpoll()
else: # run the above "debugee" program through this intermittent hang detector
NOTICE: (2024-03-05) This test harness is a work in progress...
]#
import std/[strutils, strformat, times, asyncdispatch, os, osproc]
const TIMEOUT_DURATION = 50 # Milliseconds
type ProcessId = typeof processID(Process()) ExecLoopStats = tuple totalDuration = default typeof epochTime() failCount = 0 successCount = 0 totalCount = 0 HangingProcessHandler = proc(pid: ProcessId)
proc execCmdWithTimeout(cmd: string, handler: HangingProcessHandler): Future[int] {.async.} = var process = startProcess(cmd, options = {poEvalCommand, poParentStreams}) template exitCode: untyped = result proc cb(fd: AsyncFD): bool = exitCode = process.waitForExit(0) addTimer(TIMEOUT_DURATION, false, cb)
proc measureExecTime(cmd: string, handler: HangingProcessHandler): Future[tuple[duration: typeof epochTime(), finished: bool]] {.async.} = let startTime = epochTime() let exitCode = await execCmdWithTimeout(cmd, handler) let endTime = epochTime() (endTime - startTime, exitCode < 128)
proc sequentialExecLoop(cmdStr: string, maxAttempts = 1000, handler: HangingProcessHandler): Future[ExecLoopStats] {.async.} =
for run in 1..maxAttempts:
let execResult = await measureExecTime(cmdStr, handler)
inc result.totalCount
if execResult.finished:
result.totalDuration += execResult.duration
inc result.successCount
else:
inc result.failCount
proc main() {.async.} = proc handleHang(pid: ProcessId) = echo "hang" let debuggeeExecPath = paramStr(1) let stats = await sequentialExecLoop(debuggeeExecPath, 1000, handleHang)
if stats.successCount > 0:
let averageDuration = stats.totalDuration / float stats.successCount
echo "Average Successful Duration: ", averageDuration
else:
echo "All runs timed out."
echo &"hang ratio: {stats.failCount}/{stats.totalCount}"
waitFor main()
I don't have a mac but i'll see if i can reproduce the issue on Linux.
Thanks for the effort
Ok I had time to understand what happens.
The problem here is that in the async callback I used a blocking receive()
function under the assumption that the socket would always receive the message.
This assumption is not necessarily true in async context if the receive proc is called before a send()
function has been called.
I will fix the test to reflect that and add a docinfo string to reflect that.
Result
blocks
Expected
More Information
I just noticed that the issue is intermittent. It sometimes passes and sometimes does not. It tends to work if I stick an
echo
after thevar fut = poller.pollAsync(1)
line in the test.It doesn't return from the
drain(timeout)
call.The problem still occurs when I use status chronos lib instead of asyncdispatch. This is the shim I use to allow chronos use:
Nim Version