Concurrency issue causes crash in SLIRP code

rhalkyard commented 4 months ago

Arculator's SLIRP interface appears to have a concurrency issue - the queue structure that it uses is not inherently thread-safe, and it is possible for the main Arculator thread to remove a packet from the queue before the SLIRP thread has finished inserting it, causing it to read bogus data and usually crash.

Any long-running heavy network activity should be able to replicate the issue, but the particular case that caused this for me was NFS activity on an emulated A440/1 under RISCiX. The chance of it occurring is still pretty low, but when reading a large file over NFS, I would usually hit it within 5 or so minutes.

I hacked a mutex into podules/common/net/net_slirp.c to serialise queue insertions and removals, and while I'm not sure if this is the only place where locking is necessary, it seems to have resolved the crashes I was seeing. Diff is below:

diff --git a/podules/common/net/net_slirp.c b/podules/common/net/net_slirp.c
index 58da9e6..ff1c05b 100644
--- a/podules/common/net/net_slirp.c
+++ b/podules/common/net/net_slirp.c
@@ -13,6 +13,8 @@ typedef struct net_slirp_t
        queueADT slirpq;
 } net_slirp_t;

+static pthread_mutex_t queue_mutex = PTHREAD_MUTEX_INITIALIZER;
+
 static void *slirp_poll_thread(void *p)
 {
        net_t *net = p;
@@ -47,7 +49,9 @@ static void *slirp_poll_thread(void *p)
 static int net_slirp_read(net_t *net, packet_t *packet)
 {
        net_slirp_t *slirp = net->p;
+       int ret = -1;

+       pthread_mutex_lock(&queue_mutex);
        if (QueuePeek(slirp->slirpq) > 0)
        {
                struct queuepacket *qp = QueueDelete(slirp->slirpq);
@@ -56,10 +60,11 @@ static int net_slirp_read(net_t *net, packet_t *packet)
                packet->data = qp->data;
                packet->len = qp->len;

-               return 0;
+               ret = 0;
        }

-       return -1;
+       pthread_mutex_unlock(&mutex);
+       return ret;
 }

 static void net_slirp_write(net_t *net, uint8_t *data, int size)
@@ -80,7 +85,9 @@ void slirp_output(const unsigned char *pkt, int pkt_len)
        p = (struct queuepacket *)malloc(sizeof(struct queuepacket));
        p->len = pkt_len;
        memcpy(p->data, pkt, pkt_len);
+       pthread_mutex_lock(&queue_mutex);
        QueueEnter(*g_slirpq, p);
+       pthread_mutex_unlock(&queue_mutex);
 //        aeh54_log("slirp_output %d @%d\n",pkt_len,p);
 }
 int slirp_can_output(void)

rhalkyard commented 4 months ago

There also seems to be an issue on the transmit side, I'm occasionally seeing segfaults inside SLIRP when RISCiX tries to transmit a fragmented packet - SLIRP appears to try to reassemble it, and segfaults inside ip_reass(). Looking at the comments in the SLIRP code, it's not even clear if that's something that's supposed to work!

rhalkyard commented 4 months ago

As an experiment, I rather crudely chopped out podules/common/net/slirp and modified podules/common/net/net_slirp.c to instead interface with libslirp, which is maintained by the QEMU project and available as a package in most, if not all Linux distros and MinGW. This works a treat, the crashes that I was seeing inside ip_reass() no longer occur.

These changes (plus the receive queue mutex in the initial issue) are in the libslirp branch of my fork if you'd like to consider them.

sarah-walker-pcem / arculator

Concurrency issue causes crash in SLIRP code #48