rust-vmm / vhost-device

'vhost-user' device backends workspace
Apache License 2.0
67 stars 46 forks source link

vhost-device-vsock: increase max queue size to 1024 #679

Closed dorjoy03 closed 2 months ago

dorjoy03 commented 2 months ago

Summary of the PR

Please summarize here why the changes in this PR are needed.

When running "vhost-device-vsock" against QEMU's "vhost-user-vsock-device" (i.e., mmio version, not the pci one) causes an error output on the vhost-device-vsock terminal:

[2024-06-30T10:57:54Z ERROR vhost_device_vsock] Fatal error: failed to handle request: invalid parameters 

And some errors on the QEMU terminal too like below:

qemu-system-x86_64: Failed to read msg header. Read -1 instead of 12.
Original request 1.
qemu-system-x86_64: Failed to write msg. Wrote -1 instead of 20.
qemu-system-x86_64: vhost VQ 1 ring restore failed: -22: Invalid argument (22)
qemu-system-x86_64: Failed to set msg fds.
qemu-system-x86_64: vhost VQ 0 ring restore failed: -22: Invalid argument (22)
qemu-system-x86_64: Error starting vhost: 71
qemu-system-x86_64: Failed to set msg fds.
qemu-system-x86_64: vhost_set_vring_call failed 22
qemu-system-x86_64: Failed to set msg fds.
qemu-system-x86_64: vhost_set_vring_call failed 22

I debugged this on the QEMU side and traced it back to trying to set the vring queue size to 1024 and I think the "InvalidParameter" error comes from the "set_vring_num" function in vhost-device-vsock. The commands to reproduce this are below:

shell1: ./target/release/vhost-device-vsock --vm guest-cid=4,uds-path=/tmp/vm4.vsock,socket=/tmp/vhost4.socket

shell2: ./qemu-system-x86_64 -M microvm,memory-backend=mem0 -kernel ./bzImage -nographic -append "console=ttyS0 nokaslr" -initrd ./initramfs.cpio.gz -m 4G --enable-kvm -cpu host -object memory-backend-memfd,id=mem0,size=4G -chardev socket,id=char0,reconnect=0,path=/tmp/vhost4.socket -device vhost-user-vsock-device,chardev=char0

I built the bzImage from linux kernel 6.9.3. From what I understand, in the linux driver side in "drivers/virtio/virtio_mmio.c" the "vm_setup_vq" function first reads the max queue size (VIRTIO_MMIO_QUEUE_NUM_MAX) and then sets it to be the queue size (VIRTIO_MMIO_QUEUE_NUM). In the QEMU side, the max queue size is in "include/hw/virtio/virtio.h" "VIRTQUEUE_MAX_SIZE" which is 1024. I think it makes sense to just increase the max queue size in vhost-device-vsock too. The errors go away if I do so.

Edit: Per review comments, a new queue-size option is introduced and the default value of queue size has been changed to 1024

Requirements

Before submitting your PR, please make sure you addressed the following requirements:

stefano-garzarella commented 2 months ago

We already discussed this change here: https://github.com/rust-vmm/vhost-device/issues/458

The VhostUserBackend::max_queue_size is used to allocate structures, so increasing it will consume more memory. 1024 seems very big, so we should not set it as default. As we discussed, the best way should be to add a parameter to control it.

stefano-garzarella commented 2 months ago

We already discussed this change here: #458

The VhostUserBackend::max_queue_size is used to allocate structures, so increasing it will consume more memory. 1024 seems very big, so we should not set it as default. As we discussed, the best way should be to add a parameter to control it.

So, I'd suggest a new queue-size parameter, something like this:

./target/release/vhost-device-vsock --vm guest-cid=4,uds-path=/tmp/vm4.vsock,socket=/tmp/vhost4.socket,queue-size=1024
epilys commented 2 months ago

@stefano-garzarella +1 on giving an argument instead. By the way have you measured how much extra memory it'd need? Maybe it's negligible.

stefano-garzarella commented 2 months ago

@stefano-garzarella +1 on giving an argument instead. By the way have you measured how much extra memory it'd need? Maybe it's negligible.

@epilys this is a good point. I didn't measure it. Do you suggest a proper way? (for generic exec I used smem in the past, not sure if for rust we have some analyzing tool)

epilys commented 2 months ago

@stefano-garzarella I think smem should be enough but there's also bytehound, heaptrack

https://github.com/koute/bytehound

https://github.com/KDE/heaptrack

stefano-garzarella commented 2 months ago
PID User     Command                           Swap      USS      PSS      RSS 

# before this PR
$ smem | grep vhost
86936 stefano  ./target/release/vhost-devi        0     1596     1644     3640 

# after this PR
$ smem | grep vhost
87359 stefano  ./target/release/vhost-devi        0     1704     1753     3768 

USS is 108 KB higher, so around 6.8% more. Given that it's 1.7 MB in the end (vs 1.6 MB), maybe we can use 1024 by default, but still add a parameter for those use cases where we want to reduce it?

dorjoy03 commented 2 months ago
PID User     Command                           Swap      USS      PSS      RSS 

# before this PR
$ smem | grep vhost
86936 stefano  ./target/release/vhost-devi        0     1596     1644     3640 

# after this PR
$ smem | grep vhost
87359 stefano  ./target/release/vhost-devi        0     1704     1753     3768 

USS is 108 KB higher, so around 6.8% more. Given that it's 1.7 MB in the end (vs 1.6 MB), maybe we can use 1024 by default, but still add a parameter for those use cases where we want to reduce it?

Yeah agree that a new parameter makes sense. I will try to add the new parameter and keep 1024 as default. Thanks!

dorjoy03 commented 2 months ago

@stefano-garzarella @epilys Thanks for reviewing.

I have now introduced a "queue-size" option and set the default queue size to 1024 in the latest changes. I don't have any experience in Rust but the changes are simple enough (I mostly followed tx_buffer_size). If you have any review comments, it would be great if you could give me code examples too in the comment if the change is not obvious. Thanks!

Edit: Unrelated to this discussion, it could probably make sense to make the vhost-device-vsock error messages more exact. I had a hard time finding out where the "invalid parameters" error was coming from. It would have been more helpful if the error message had more details that it was when trying to set a queue size 1024 but max queue size was 256.

epilys commented 2 months ago

Edit: Unrelated to this discussion, it could probably make sense to make the vhost-device-vsock error messages more exact. I had a hard time finding out where the "invalid parameters" error was coming from. It would have been more helpful if the error message had more details that it was when trying to set a queue size 1024 but max queue size was 256.

A lot of the errors and log messages are indeed not as descriptive as they could be. If you want to change this particular case we'd love a PR! Otherwise we can open a ticket and someone will get to it eventually :)

dorjoy03 commented 2 months ago

Edit: Unrelated to this discussion, it could probably make sense to make the vhost-device-vsock error messages more exact. I had a hard time finding out where the "invalid parameters" error was coming from. It would have been more helpful if the error message had more details that it was when trying to set a queue size 1024 but max queue size was 256.

A lot of the errors and log messages are indeed not as descriptive as they could be. If you want to change this particular case we'd love a PR! Otherwise we can open a ticket and someone will get to it eventually :)

I probably won't look into it now so I suggest opening a ticket. Thanks.

stefano-garzarella commented 2 months ago

Edit: Unrelated to this discussion, it could probably make sense to make the vhost-device-vsock error messages more exact. I had a hard time finding out where the "invalid parameters" error was coming from. It would have been more helpful if the error message had more details that it was when trying to set a queue size 1024 but max queue size was 256.

A lot of the errors and log messages are indeed not as descriptive as they could be. If you want to change this particular case we'd love a PR! Otherwise we can open a ticket and someone will get to it eventually :)

I probably won't look into it now so I suggest opening a ticket. Thanks.

@dorjoy03 can you open it and describe the issue?

stefano-garzarella commented 2 months ago

The PR LGTM, but I'd remove Fixes: #202 since other devices need to be updated too. Maybe we can just mention it.

Also, what about splitting the commit in two? The first one you introduce the new parameter, leaving the default to 256. The second commit will increase it to 1024 explaining why. So in the future, if we need to return to 256, we can just revert the second patch.

dorjoy03 commented 2 months ago

Edit: Unrelated to this discussion, it could probably make sense to make the vhost-device-vsock error messages more exact. I had a hard time finding out where the "invalid parameters" error was coming from. It would have been more helpful if the error message had more details that it was when trying to set a queue size 1024 but max queue size was 256.

A lot of the errors and log messages are indeed not as descriptive as they could be. If you want to change this particular case we'd love a PR! Otherwise we can open a ticket and someone will get to it eventually :)

I probably won't look into it now so I suggest opening a ticket. Thanks.

@dorjoy03 can you open it and describe the issue?

No problem. Will do.

The PR LGTM, but I'd remove Fixes: #202 since other devices need to be updated too. Maybe we can just mention it.

Good point. I did not realize the ticket was about other daemons as well. Will remove.

Also, what about splitting the commit in two? The first one you introduce the new parameter, leaving the default to 256. The second commit will increase it to 1024 explaining why. So in the future, if we need to return to 256, we can just revert the second patch.

Yep, will split into 2.

dorjoy03 commented 2 months ago

@stefano-garzarella I have now split the changes into 2 commits and submitted an issue about making the error logs more descriptive https://github.com/rust-vmm/vhost-device/issues/682

stefano-garzarella commented 2 months ago

@epilys your "Approve" is still valid for GH, but I'd like to double-check before merge this. Is this version okay for you?

epilys commented 2 months ago

@stefano-garzarella was already on it, thanks!

stefano-garzarella commented 2 months ago

@dorjoy03 thanks for this PR!