The host uses fwd_cnt and buf_alloc obtained in the guest's vsock packets for its buffer management (deciding whether to send a next packet to guest or back off). We previously incorrectly implemented this by making the two variables global for all virtual sockets. In reality, these variables are per-connection (per-socket). This mismatch between what the host expects and what Gramine guest implements led to hangs because on new connections, Gramine reported too many "bytes received" in fwd_cnt counter, leading the host to believe there are too many packets in flight on the new connection and to back off constantly.
Fix by moving the vars to fields of struct virtio_vsock_connection. An additional benefit is that we already have proper locking for per-connection fields, so there is no need for atomics or special synchronization.
Description of the changes
The host uses
fwd_cnt
andbuf_alloc
obtained in the guest's vsock packets for its buffer management (deciding whether to send a next packet to guest or back off). We previously incorrectly implemented this by making the two variables global for all virtual sockets. In reality, these variables are per-connection (per-socket). This mismatch between what the host expects and what Gramine guest implements led to hangs because on new connections, Gramine reported too many "bytes received" infwd_cnt
counter, leading the host to believe there are too many packets in flight on the new connection and to back off constantly.Fix by moving the vars to fields of
struct virtio_vsock_connection
. An additional benefit is that we already have proper locking for per-connection fields, so there is no need for atomics or special synchronization.Interestingly, Linux prior to v6.7 did not expose this Gramine bug because it itself had a bug of integer wrap around. This was fixed with commit https://github.com/torvalds/linux/commit/60316d7f10b17a in v6.7.
Fixes #24.
How to test this PR?
Run Redis or Memcached on Linux v6.7 or higher. I tested on v6.9.
This change is