Closed dorjoy03 closed 2 months ago
I see a few checks have failed with the same 4 tests that I added which require connecting/listening using AF_VSOCK with error "Operation not permitted" etc. The tests pass when I run them in a debian docker container. So I am guessing something needs to be updated about the VMs that run these tests.
I see a few checks have failed with the same 4 tests that I added which require connecting/listening using AF_VSOCK with error "Operation not permitted" etc. The tests pass when I run them in a debian docker container. So I am guessing something needs to be updated about the VMs that run these tests.
@andreeaflorescu do you know if our CI runners has vsock_loopback
kernel module available?
It should be automatically loaded when opening an AF_VSOCK socket, and it provides the CID 1 handling, so it may be the issue if it's not available, since we can't bind CID 1 IIUC.
I see a few checks have failed with the same 4 tests that I added which require connecting/listening using AF_VSOCK with error "Operation not permitted" etc. The tests pass when I run them in a debian docker container. So I am guessing something needs to be updated about the VMs that run these tests.
@andreeaflorescu do you know if our CI runners has
vsock_loopback
kernel module available?It should be automatically loaded when opening an AF_VSOCK socket, and it provides the CID 1 handling, so it may be the issue if it's not available, since we can't bind CID 1 IIUC.
Indeed on my laptop tests are working and I have this behaviour:
$ lsmod | grep vsock_loopback
$ cargo test -p vhost-device-vsock
...
test result: ok. 38 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.01s
$ lsmod | grep vsock_loopback
vsock_loopback 12288 0
vmw_vsock_virtio_transport_common 57344 1 vsock_loopback
vsock 61440 3 vmw_vsock_virtio_transport_common,vsock_loopback,vmw_vsock_vmci_transport
If we can't enabled that kernel module, maybe we should just skip the test if we can't bind CID 1.
@stefano-garzarella I have now addressed some of the comments but I needed some input or had some questions about the others. I have not yet updated README and changelog but will do so after I get through all the reviews at the end. Can you please do another round of review and help me with the other reviews? Thanks!
I have now gone through the reviews and readme and changelog is updated as well. Thanks for the reviews @stefano-garzarella
Interestingly, the 'coverage-x86_64' run started failing but it didn't fail last time so to check if it was my latest push that did something, I reset the branch to the old commit (with a whitespace change to trigger the CI) and the 'coverage-x86_64' failed again which didn't fail before. So I am not sure why the new failure occurred. Did anything change on the CI side?
I have now gone through the reviews and readme and changelog is updated as well. Thanks for the reviews @stefano-garzarella
Interestingly, the 'coverage-x86_64' run started failing but it didn't fail last time so to check if it was my latest push that did something, I reset the branch to the old commit (with a whitespace change to trigger the CI) and the 'coverage-x86_64' failed again which didn't fail before. So I am not sure why the new failure occurred. Did anything change on the CI side?
We updated the CI, but we shouldn't touch anything about coverage, so this could be related to the test failure, since coverage runs the test, so if some tests are failing, I think also the coverage can't be calculated.
What about adding a third commit where we skip the new vsock tests if we can't create AF_VSOCK sockets? I suggest a separate commit, so we can revert it when we fix our CI.
We updated the CI, but we shouldn't touch anything about coverage, so this could be related to the test failure, since coverage runs the test, so if some tests are failing, I think also the coverage can't be calculated.
What about adding a third commit where we skip the new vsock tests if we can't create AF_VSOCK sockets?
Yeah I can look into skipping the tests but did you get a chance to look into fixing the CI so that we are able to run them? Running them in a standard debian docker container works fine for me.
We updated the CI, but we shouldn't touch anything about coverage, so this could be related to the test failure, since coverage runs the test, so if some tests are failing, I think also the coverage can't be calculated. What about adding a third commit where we skip the new vsock tests if we can't create AF_VSOCK sockets?
Yeah I can look into skipping the tests but did you get a chance to look into fixing the CI so that we are able to run them? Running them in a standard debian docker container works fine for me.
It depends on the host kernel as I mentioned here: https://github.com/rust-vmm/vhost-device/pull/706#issuecomment-2321312778
Our CI runners are provided by AWS, so we need some input from them. @roypat @andreeaflorescu do you have any advice?
@stefano-garzarella I have now introduced the "proxy" feature and made it default and "ignore"d the 4 tests that were failing in the CI. A lot of changes so please re-review all the commits.
I think in the CI for now we should make a change to run both "cargo build --release --no-default-features" "cargo test --no-default-features" and "cargo build --release" "cargo test" i.e., run without the feature and with the feature so that we cover everything. Please run locally with the feature turned off and on to verify that I haven't broken anything. Thanks!
@stefano-garzarella I have now introduced the "proxy" feature and made it default and "ignore"d the 4 tests that were failing in the CI. A lot of changes so please re-review all the commits.
Done, thanks!
I think in the CI for now we should make a change to run both "cargo build --release --no-default-features" "cargo test --no-default-features" and "cargo build --release" "cargo test" i.e., run without the feature and with the feature so that we cover everything. Please run locally with the feature turned off and on to verify that I haven't broken anything. Thanks!
Yeah, we have an issue opened in rust-vmm-ci: https://github.com/rust-vmm/rust-vmm-ci/issues/152 If you have time/desire, feel free to take a look at it.
We updated the CI, but we shouldn't touch anything about coverage, so this could be related to the test failure, since coverage runs the test, so if some tests are failing, I think also the coverage can't be calculated. What about adding a third commit where we skip the new vsock tests if we can't create AF_VSOCK sockets?
Yeah I can look into skipping the tests but did you get a chance to look into fixing the CI so that we are able to run them? Running them in a standard debian docker container works fine for me.
It depends on the host kernel as I mentioned here: #706 (comment)
Our CI runners are provided by AWS, so we need some input from them. @roypat @andreeaflorescu do you have any advice?
Mh, I just connected to one of our CI runners, and it seems the module is available:
$ find /lib/modules/$(uname -r) -type f -name '*.ko' | grep vsock
/lib/modules/6.5.0-1024-aws/kernel/net/vmw_vsock/vmw_vsock_vmci_transport.ko
/lib/modules/6.5.0-1024-aws/kernel/net/vmw_vsock/vmw_vsock_virtio_transport_common.ko
/lib/modules/6.5.0-1024-aws/kernel/net/vmw_vsock/vsock_loopback.ko <------
/lib/modules/6.5.0-1024-aws/kernel/net/vmw_vsock/vmw_vsock_virtio_transport.ko
/lib/modules/6.5.0-1024-aws/kernel/net/vmw_vsock/hv_sock.ko
/lib/modules/6.5.0-1024-aws/kernel/net/vmw_vsock/vsock.ko
/lib/modules/6.5.0-1024-aws/kernel/net/vmw_vsock/vsock_diag.ko
/lib/modules/6.5.0-1024-aws/kernel/drivers/vhost/vhost_vsock.ko
maybe it's something to do with docker? :thinking:
Mh, I just connected to one of our CI runners, and it seems the module is available:
@roypat thanks for checking!
$ find /lib/modules/$(uname -r) -type f -name '*.ko' | grep vsock /lib/modules/6.5.0-1024-aws/kernel/net/vmw_vsock/vmw_vsock_vmci_transport.ko /lib/modules/6.5.0-1024-aws/kernel/net/vmw_vsock/vmw_vsock_virtio_transport_common.ko /lib/modules/6.5.0-1024-aws/kernel/net/vmw_vsock/vsock_loopback.ko <------ /lib/modules/6.5.0-1024-aws/kernel/net/vmw_vsock/vmw_vsock_virtio_transport.ko /lib/modules/6.5.0-1024-aws/kernel/net/vmw_vsock/hv_sock.ko /lib/modules/6.5.0-1024-aws/kernel/net/vmw_vsock/vsock.ko /lib/modules/6.5.0-1024-aws/kernel/net/vmw_vsock/vsock_diag.ko /lib/modules/6.5.0-1024-aws/kernel/drivers/vhost/vhost_vsock.ko
maybe it's something to do with docker? 🤔
yeah, it could be.
Can we make sure they are loaded? Theoretically if they're not loaded, the first time an AF_VSOCK socket is created, the kernel loads af_vsock and vsock_loopback automatically, but it's possible that being inside a docker, this is not allowed.
One way to test it, is the following:
# start the container
# run this in the container
python3 -c "import socket; s = socket.socket(socket.AF_VSOCK, socket.SOCK_STREAM); s.bind((1, socket.VMADDR_PORT_ANY))"
Thanks. Good stuff!
python3 -c "import socket; s = socket.socket(socket.AF_VSOCK, socket.SOCK_STREAM); s.bind((1, socket.VMADDR_PORT_ANY))"
Sorry for the delay, finally got around to testing this. I got the following result:
$ lsmod | grep vsock
$ sudo docker run --device=/dev/kvm -it --rm --volume $(pwd):/workdir --workdir /workdir --privileged rustvmm/dev:v44
root@05334d5ffd14:/workdir# python3 -c "import socket; s = socket.socket(socket.AF_VSOCK, socket.SOCK_STREAM); s.bind((1, socket.VMADDR_PORT_ANY))"
$ lsmod | grep vsock
vsock_loopback 12288 0
vmw_vsock_virtio_transport_common 61440 1 vsock_loopback
vmw_vsock_vmci_transport 40960 0
vmw_vmci 106496 1 vmw_vsock_vmci_transport
vsock 69632 3 vmw_vsock_virtio_transport_common,vsock_loopback,vmw_vsock_vmci_transport
ssm-user@ip-172-31-68-60:/var/snap/amazon-ssm-agent/7994$
So it does seem to get loaded even from the docker container.
@roypat thanks for checking.
Looking at the commit where the CI failed, the docker container was started in this way:
docker run -t -i --rm --init --volume /var/lib/buildkite-agent/builds/ip-172-31-66-96-2/rust-vmm/vhost-device-ci:/workdir --workdir /workdir --label com.buildkite.job-id=01913daa-2e9d-4bf3-808b-eaf6aca7f0ff rustvmm/dev:v35 /bin/sh -e -c cargo\ test\ --all-features\ --workspace
So I'm not sure if for example --privileged
allowed the AF_VSOCK socket to be created in your test, while it failed in the pipeline.
Yes, that sounds quite plausible. I'll make a note to look into having the CI runners just load the vsock module on startup! (although might not get to it this or next week :( )
Yes, that sounds quite plausible. I'll make a note to look into having the CI runners just load the vsock module on startup! (although might not get to it this or next week :( )
Great! BTW no rush at all, we already merged this PR, and we skipped the tests if we can't use AF_VSOCK.
Summary of the PR
Hi. I have been working on adding AWS Nitro Enclave emulation support in QEMU. Nitro enclave VMs are spawned from EC2 VMs and the enclave VMs always see the parent VM as CID 3 and the applications in the enclave VM talk to the parent VM using AF_VSOCK to CID 3. More details about this are in the cover letter for the patch posted in qemu-devel.
From the QEMU side, it would be cumbersome to run a separate VM with CID 3 with the necessary applications while running a nitro-enclave VM. So there was some discussion in the virtualization mailing list about how we could make the UX simpler for users without requiring to run a separate VM with CID 3 and instead run the applications of the parent EC2 VM in the host machine. At the end, it was suggested by Stefano that we could use vhost-user-vsock for vsock emulation in QEMU for nitro-enclave and then in vhost-device-vsock we could add support for forwarding the vsock packets to the host machine. The discussion can be found in this thread.
This PR adds 2 new options to vhost-device-vsock:
forward-cid
andvsock-listen
.forward-cid: When this option is given (usually with the value 1), all connections from the guest VM will be forwarded to this CID to the same ports regardless of their destination CID.
forward-listen: This is a string of port numbers separated by '+'. The vhost-device-vsock will be listening to these ports in the host machine and the connections will be forwarded to the guest VM.
For testing, first run the vhost-device-vsock from this branch:
Then run a VM:
Now you can run a AF_VSOCK server in the host machine and a client inside the VM that connects to the same port but any CID (e.g., CID 3) and verify that the server and client communication works. You can run a server inside the VM listening to port 9001 and run a client in host that connects to CID 1 port 9001 and verify that the communication works.
Requirements
Before submitting your PR, please make sure you addressed the following requirements:
git commit -s
), and the commit message has max 60 characters for the summary and max 75 characters for each description line.unsafe
code is properly documented.