Open rhalkyard opened 4 months ago
There also seems to be an issue on the transmit side, I'm occasionally seeing segfaults inside SLIRP when RISCiX tries to transmit a fragmented packet - SLIRP appears to try to reassemble it, and segfaults inside ip_reass()
. Looking at the comments in the SLIRP code, it's not even clear if that's something that's supposed to work!
As an experiment, I rather crudely chopped out podules/common/net/slirp
and modified podules/common/net/net_slirp.c
to instead interface with libslirp, which is maintained by the QEMU project and available as a package in most, if not all Linux distros and MinGW. This works a treat, the crashes that I was seeing inside ip_reass()
no longer occur.
These changes (plus the receive queue mutex in the initial issue) are in the libslirp branch of my fork if you'd like to consider them.
Arculator's SLIRP interface appears to have a concurrency issue - the queue structure that it uses is not inherently thread-safe, and it is possible for the main Arculator thread to remove a packet from the queue before the SLIRP thread has finished inserting it, causing it to read bogus data and usually crash.
Any long-running heavy network activity should be able to replicate the issue, but the particular case that caused this for me was NFS activity on an emulated A440/1 under RISCiX. The chance of it occurring is still pretty low, but when reading a large file over NFS, I would usually hit it within 5 or so minutes.
I hacked a mutex into
podules/common/net/net_slirp.c
to serialise queue insertions and removals, and while I'm not sure if this is the only place where locking is necessary, it seems to have resolved the crashes I was seeing. Diff is below: