These issues were discovered while trying to test the current implementation of sibling VM communication in vhost-user-vsock. The testing was done with iperf-vsock and nc-vsock, both patched to set .svm_flags = VMADDR_FLAG_TO_HOST.
Issues
Deadlock
If you try to test the sibling communication by running iperf-vsock or transferring big files with nc-vsock, the vhost-user-vsock process hangs and becomes completely irresponsive. After a bit of debugging, I discovered that there is deadlock.
The deadlock occurs when two sibling VMs simultaneously try to send each other packets. The VhostUserVsockThreads corresponding to both the VMs hold their own locks while executing thread_backend.send_pkt and then try to lock each other to access their counterpart's raw_pkts_queue. This ultimately results in a deadlock.
The deadlock can be resolved by separating the mutex over raw_pkts_queue from the mutex over VhostUserVsockThread.
Raw packets queue not being processed completely
Even after resolving the deadlock, the vhost-user-vsock process still hangs while testing, though not completely irresponsive this time. It turns out that sometimes the raw packets pending on the raw_pkts_queue are never processed, resulting in the hang.
This happens because currently, the raw_pkts_queue is processed only when a SIBLING_VM_EVENT is received. But it may happen that the raw_pkts_queue could not be processed completely due to insufficient space in the RX virtqueue at that time.
This can be resolved by trying to process raw packets on other events too similar to what happens in the RX of standard packets.
Current status
While fixing the above two issues seems to make nc-vsock run flawlessly, testing with iperf-vsock still results in the vhost-user-vsock process hanging. There might be a notification problem and could be related to the EVENT_IDX feature.
While #385 resolves the deadlock and the problem with raw packets queue not being processed completely, iperf-vsock still doesn't work. Following could be the reasons for that:
There might be a notification problem inside the vhost-user-vsock application
It could be due to the way iperf works internally
It might have something to do with vsock credit updates
These issues were discovered while trying to test the current implementation of sibling VM communication in vhost-user-vsock. The testing was done with iperf-vsock and nc-vsock, both patched to set
.svm_flags = VMADDR_FLAG_TO_HOST
.Issues
Deadlock
If you try to test the sibling communication by running
iperf-vsock
or transferring big files withnc-vsock
, the vhost-user-vsock process hangs and becomes completely irresponsive. After a bit of debugging, I discovered that there is deadlock.The deadlock occurs when two sibling VMs simultaneously try to send each other packets. The
VhostUserVsockThread
s corresponding to both the VMs hold their own locks while executingthread_backend.send_pkt
and then try to lock each other to access their counterpart'sraw_pkts_queue
. This ultimately results in a deadlock.In particular, this line of code unleashes the deadlock.
The deadlock can be resolved by separating the mutex over
raw_pkts_queue
from the mutex overVhostUserVsockThread
.Raw packets queue not being processed completely
Even after resolving the deadlock, the vhost-user-vsock process still hangs while testing, though not completely irresponsive this time. It turns out that sometimes the raw packets pending on the
raw_pkts_queue
are never processed, resulting in the hang.This happens because currently, the
raw_pkts_queue
is processed only when aSIBLING_VM_EVENT
is received. But it may happen that theraw_pkts_queue
could not be processed completely due to insufficient space in the RX virtqueue at that time.This can be resolved by trying to process raw packets on other events too similar to what happens in the RX of standard packets.
Current status
While fixing the above two issues seems to make
nc-vsock
run flawlessly, testing withiperf-vsock
still results in the vhost-user-vsock process hanging. There might be a notification problem and could be related to theEVENT_IDX
feature.