shawnanastasio / libkvmchan

An implementation of the Xen vchan API on KVM
Other
10 stars 4 forks source link

Not so careful review comments #1

Open marmarek opened 5 years ago

marmarek commented 5 years ago

I've briefly looked at the code and have some random comments:

Design, interface

On all subsequent operations, this copy is compared to the one in the shared memory region.

If something isn't suppose to change, it should be enough to use a local copy - which is guaranteed to not change, instead of verifying each time - which a) may be forgotten, b) will reduce performance.

For reference, Xen headers: Shared ring structure: https://github.com/xen-project/xen/blob/master/xen/include/public/io/libxenvchan.h Library own data: https://github.com/xen-project/xen/blob/master/tools/libvchan/libxenvchan.h

cons and prod are used from shared ring (and bound checked), but order and other metadata are stored locally, since the other end isn't supposed to change them.

This probably means struct ringbuf should be actually two structures - one shared and one not. Especially eventfd related fields should be in the non-shared one (you have already TODO comment about that).

I'd also encourage you to not keep any pointers in shared structure. Only offsets/indexes. Even if you validate pointers properly, it will leaks some info about memory layout of the other side.

Generally libxenvchan protocol works lock-less only because each index is written only by one side of the connection, the other side only read it. If you have anything that can be written by both sides, you need to be very careful to avoid all kind of race conditions. In many cases it's impossible to do that without additional locks. To the point: I'd encourage you to drop full flag, at the cost of 1 byte of ring space - i.e. pos_start==pos_end means ring empty, pos_end==pos_start-1 means ring full.

For Qubes OS use case, it's necessary to support stream mode, not only packet mode. In short, it should be possible to either get available data/space size (via public interface) - preferred. Or read/write should handle data on best-effort basis - partial read/write if possible - return how much data was handled. See read(2) and write(2) semantics.

Implementation

When dealing with shared memory, it's critical to validate data only when the other side can no longer change it. In many places you copy the whole structure (which is safe, but may be slow). But there are cases where you don't do that: https://github.com/shawnanastasio/libkvmchan/blob/953fc36f5c11a2d02b9bac1cdb38e27c989c6621/ringbuf.c#L279-L286 You read both indexes here multiple times. In ringbuf_read_sec and ringbuf_write_sec you call it directly on the shared buffer - so the other end may freely manipulate them in the meantime. At very least it means that initially it may appear there is data/space in the buffer, but during ringbuf_sec_validate_untrusted_* call indexes may be modified that no data/space is there. Then during ringbuf_read/ringbuf_write you'll hit busy-loop waiting for data/space. And since it operate on local buffer copy, it will never get it.

Using busy-wait is a bad idea. For PoC phase it may be ok, but production use it is not acceptable, regardless of usleep() parameter. There must be some notification mechanism, without the need for constant polling.

That said, ringbuf_clear_eventfd lack pthread_join - it leaks threads.

I have not verified safety of all the index and pointers handling at this stage.

VM-VM communication, connection setup

In the current shape, it works only host-guest. For proper qrexec support, it needs guest-guest connection. AFAIU it should be possible, if both guests are given shared memory backed by the same host memory (mem-path argument for memory-backend-file object), right? Right now Qubes OS abstraction over libvchan does not include interface for host-side setup/cleanup of guest-guest channel, but it shouldn't be a problem to add one. Is it possible to attach/detach ivshmem devices at runtime (for example with device_add/device_del QMP commands)? Otherwise, there would be a need to setup a predefined set of devices, which would limit connections count (especially painful in guest-guest case).

shawnanastasio commented 5 years ago

Thanks for the feedback! I'll try to address each of your points here.

If something isn't suppose to change, it should be enough to use a local copy - which is guaranteed to not change, instead of verifying each time - which a) may be forgotten, b) will reduce performance.

For reference, Xen headers: Shared ring structure: https://github.com/xen-project/xen/blob/master/xen/include/public/io/libxenvchan.h Library own data: https://github.com/xen-project/xen/blob/master/tools/libvchan/libxenvchan.h

cons and prod are used from shared ring (and bound checked), but order and other metadata are stored locally, since the other end isn't supposed to change them.

This probably means struct ringbuf should be actually two structures - one shared and one not. Especially eventfd related fields should be in the non-shared one (you have already TODO comment about that).

This can certainly be done. I'll create a separate issue to track progress on this.

I'd also encourage you to not keep any pointers in shared structure. Only offsets/indexes. Even if you validate pointers properly, it will leaks some info about memory layout of the other side.

I hadn't considered the potential for information leaking here. This will get a separate issue too.

Generally libxenvchan protocol works lock-less only because each index is written only by one side of the connection, the other side only read it. If you have anything that can be written by both sides, you need to be very careful to avoid all kind of race conditions. In many cases it's impossible to do that without additional locks. To the point: I'd encourage you to drop full flag, at the cost of 1 byte of ring space - i.e. pos_start==pos_end means ring empty, pos_end==pos_start-1 means ring full.

I completely missed the potential race here. This will also get a separate issue.

For Qubes OS use case, it's necessary to support stream mode, not only packet mode. In short, it should be possible to either get available data/space size (via public interface) - preferred. Or read/write should handle data on best-effort basis - partial read/write if possible - return how much data was handled. See read(2) and write(2) semantics.

Indeed, stream mode is planned. It is my understanding that Qubes' libvchan exposes both of these interfaces (POSIX read/write semantics as well as getting the size). For completeness, it would probably be best to implement both of these, I suppose.

You read both indexes here multiple times. In ringbuf_read_sec and ringbuf_write_sec you call it directly on the shared buffer - so the other end may freely manipulate them in the meantime.

Great catch, this is a tricky one. It seems that ringbuf_free_space will need to be changed to read each value only once, and the relevant ringbuf_*_sec functions will need to be updated to handle a situation in which the pos_* values change after they've been checked (or remove the ability to change them in the first place). I'll create an issue for this.

Using busy-wait is a bad idea. For PoC phase it may be ok, but production use it is not acceptable, regardless of usleep() parameter. There must be some notification mechanism, without the need for constant polling.

Yeah, this is definitely temporary. From my initial inspection, ivshmem does provide an interrupt facility for VM-VM communication but not for host-VM communication. Perhaps a separate notification mechanism that supports all cases will have to be used.

That said, ringbuf_clear_eventfd lack pthread_join - it leaks threads.

That's embarrassing. I'll fix that ASAP.

In the current shape, it works only host-guest. For proper qrexec support, it needs guest-guest connection. AFAIU it should be possible, if both guests are given shared memory backed by the same host memory (mem-path argument for memory-backend-file object), right?

This is correct. In fact, the library already supports this if a user was to use uio handles for both the client and server. This whole interface is a WIP though, and the end goal is to of course emulate the existing libvchan interface and allow the vchans to be addressed by domain number and devno.

On this note, I have begun work on a daemon which runs on the Host to facilitate this. The idea is to have a dedicated ivshmem device on each VM that's solely used for communication with the daemon. The daemon would provide facilities for resolving domain numbers to the appropriate shared memory handles, as well as allocation of memory within VM-VM ivshmem regions to allow creation of an arbitrary number of ringbuffers (within memory constraints, of course).

Currently, the daemon uses libvirt to detect VM configuration and resolve domain numbers. I must admit I'm not too familiar with Qubes' existing VM management, but it seems it also uses libvirt? If that's the case, hopefully this architecture isn't too problematic.

Is it possible to attach/detach ivshmem devices at runtime (for example with device_add/device_del QMP commands)? Otherwise, there would be a need to setup a predefined set of devices, which would limit connections count (especially painful in guest-guest case).

This is also something that I'm investigating. The goal is to integrate this functionality into the aforementioned daemon, so that upon vchan creation, the library can transparently communicate with the daemon and have it assert that an appropriate ivshmem device exists between the desired domain id and the caller, or create it otherwise.

Thank you again for your comments.

marmarek commented 5 years ago

Indeed, stream mode is planned. It is my understanding that Qubes' libvchan exposes both of these interfaces (POSIX read/write semantics as well as getting the size). For completeness, it would probably be best to implement both of these, I suppose.

Yes, but having an interface for getting sizes, it's trivial to emulate the POSIX one. But with POSIX read/write, you can't easily get sizes.

On this note, I have begun work on a daemon which runs on the Host to facilitate this. The idea is to have a dedicated ivshmem device on each VM that's solely used for communication with the daemon. The daemon would provide facilities for resolving domain numbers to the appropriate shared memory handles, as well as allocation of memory within VM-VM ivshmem regions to allow creation of an arbitrary number of ringbuffers (within memory constraints, of course).

That's also a possibility. But to be honest, I really like the idea for adjusting libvchan API to require host coordination in setting up VM-VM connection. Right now (in Xen case), it is theoretically possible for two cooperating VMs to establish vchan connection without dom0 approval. Improving that would be a good thing. This does not exclude a daemon coordinating it.

I'm not too familiar with Qubes' existing VM management, but it seems it also uses libvirt?

Yes.

shawnanastasio commented 5 years ago

That's also a possibility. But to be honest, I really like the idea for adjusting libvchan API to require host coordination in setting up VM-VM connection. Right now (in Xen case), it is theoretically possible for two cooperating VMs to establish vchan connection without dom0 approval. Improving that would be a good thing.

Interesting. How would you envision such an interface? Something along the lines of the following?:

int libvchan_can_reach_domain(struct libvchan *ctrl, int domain);
int libvchan_establish_connection(struct libvchan *ctrl, int domain);

Where the former can be used to determine whether or not a VM-VM connection between the caller and the given domain exists already and the former can be used to establish a VM-VM connection (with assistance/permission from dom0)?

marmarek commented 5 years ago

To get ctrl, you need to actually establish vchan connection (using `libvchan_*_init), so it isn't appropriate. I think it rather should be a separate functions like:

/*
 * call on host to allow *domain_server* to call libvchan_server_init(domain_client, port, ...) 
 * and *domain_client* to call libvchan_client_init(domain_server, port)
 */
int libvchan_setup_connection(int domain_server, int domain_client, int port);
/* call on the host after *domain_server* <-> *domain_client* connection on *port* is closed */
int libvchan_cleanup_connection(int domain_server, int domain_client, int port);

I'm not sure about the return value - maybe some struct, then given as an argument for libvchan_cleanup_connection? Also, not sure what should be intended behavior if libvchan_cleanup_connection is called before libvchan_close inside. Either an error, or should forcibly close the connection.

shawnanastasio commented 5 years ago

To get ctrl, you need to actually establish vchan connection (using `libvchan_*_init), so it isn't appropriate.

Oh, right. That wouldn't make much sense indeed.

I'm not sure about the return value - maybe some struct, then given as an argument for libvchan_cleanup_connection?

Yeah, I think some opaque pointer/handle type would be fine.

Also, not sure what should be intended behavior if libvchan_cleanup_connection is called before libvchan_close inside. Either an error, or should forcibly close the connection.

That's an interesting question. My initial impression is that allowing libvchan_cleanup_connection while existing vchans are open seems like bad program design and denying it would allow certain classes of bugs to be discovered (for example, failure to properly manage open vchans).

On the other hand, it is likely the case that multiple applications will need to open vchans to the same domain, so in order for an application to safely call cleanup it would have to know of all the other applications that can open vchans and kill them/wait for them to close.

Another option I see is introducing some sort of reference counter that is incremented every time an application calls setup and decremented on cleanup. That way when the counter reaches 0, the connection can be safely closed. This of course introduces a large potential for leaks which may end up being an even worse design, though.

In the end, you're much more familiar with the Qubes ecosystem and how applications utilize vchans, so I'll defer to your judgement on this one.

marmarek commented 5 years ago

On the other hand, it is likely the case that multiple applications will need to open vchans to the same domain

That's what you have port number for. And by the way, this mechanism could be also used for ports allocation - right now qrexec does it itself, but it doesn't necessary need to stay this way.

shawnanastasio commented 5 years ago

That's what you have port number for.

Oh alright, that makes sense then.

And by the way, this mechanism could be also used for ports allocation - right now qrexec does it itself, but it doesn't necessary need to stay this way.

Could you elaborate on that? Do you mean it would be possible to remove the used_vchan_ports array (here) and the associated management of it and replace it with the proposed libvchan API changes?

If so, would you introduce another libvchan API call to return the first free port for a given domain pair? Without such a function, it seems qrexec would need a way to determine whether a port is free, perhaps by calling libvchan_setup_connection with increasing port values until the call doesn't fail.

marmarek commented 5 years ago

Could you elaborate on that? Do you mean it would be possible to remove the used_vchan_ports array (here) and the associated management of it and replace it with the proposed libvchan API changes?

Yes.

If so, would you introduce another libvchan API call to return the first free port for a given domain pair?

Or make libvchan_setup_connection arrange it. Like, make port parameter in-out, with initial value 0 or such meaning "please allocate one".

shawnanastasio commented 5 years ago

Or make libvchan_setup_connection arrange it. Like, make port parameter in-out, with initial value 0 or such meaning "please allocate one".

Gotcha. Though I must cast my vote against the specific in-out semantics in this case, since it seems a bit clumsy.

Perhaps retain port as a regular int and when passed as 0, the return value is interpreted as the newly allocated port. POSIX open(2) semantics could be potentially be emulated in this regard, and -1 could signify failure.

Semantics aside, I think this is a reasonable set of API changes and moving port allocation out of qrexec seems like a definite improvement. Is the Qubes project likely to implement them? If so, I can add them to the roadmap for this project too.

marmarek commented 5 years ago

Perhaps retain port as a regular int and when passed as 0, the return value is interpreted as the newly allocated port. POSIX open(2) semantics could be potentially be emulated in this regard, and -1 could signify failure.

But then the idea of returning some opaque struct pointer, for libvchan_cleanup_connection call wouldn't work... Or maybe that pointer should be yet another argument (output only)?

Is the Qubes project likely to implement them?

Yes, including adjusting qrexec and vchan for xen. If that would make KVM part more solid and perhaps even simpler.

shawnanastasio commented 5 years ago

Or maybe that pointer should be yet another argument (output only)?

Yeah, that seems reasonable to me.

Yes, including adjusting qrexec and vchan for xen. If that would make KVM part more solid and perhaps even simpler.

Glad to hear. The API changes do certainly match the semantics for KVM/ivshmem more closely, so you've got a +1 from me :)