Feature proposal: pluggable storage backends

stefanberger / swtpm

Libtpms-based TPM emulator with socket, character device, and Linux CUSE interface.

Other

562 stars 136 forks source link

Feature proposal: pluggable storage backends #461

Closed franciozzy closed 2 years ago

franciozzy commented 3 years ago

Overview Today SWTPM only supports writing TPM state as files in a filesystem. That is too restrictive for deployments that store VM data in external storage infrastructures accessible via NVMeOF, iSCSI, S3, &c. Using external storage is particularly relevant for VMs that are protected by high-availability features, where a host crash is handled by VMs restarting elsewhere. In those cases, the TPM data must be synchronously stored outside of the host.

Mounting external storage directly on the host and using it with a general purpose filesystem (so that SWTPM can be used AS IS) can violate security certifications and has other operability, scalability, and maintainability concerns. It is much preferable to plug a storage driver directly in SWTPM and have it accessing the external storage directly from userspace.

Proposal We would like to propose an abstraction layer with a similar (or identical) interface as offered by swtpm_nvfile.h. The default implementation would continue to work as is, but plugins could be written to implement arbitrary storage backends. One possibility is to enhance the SWTPM command line interface to indicate which storage plugin to use.

We would like to use this Issue to discuss the architecture specification / design of this feature request. If the community is happy to see some code, we will gladly publish an RFC as a PR or separate GH branch on a fork.

stefanberger commented 3 years ago

Can you rename the title to 'Feature proposal'?

I understand the support for S3. Presumably one formulates a per-VM URL where to store the state under. How is this supposed to work with iSCSI and multiple VMs' vTPM state and assignments and management of blocks/sectors on a storage device? I know that you are very interest in this type of backend, so I'd be curious to learn how this is supposed to work.

Otherwise my suggestion would be that we define an interface for the storage and push all the storage behind that interface, leaving the encryption/decryption to the core code in swtpm and all data before it reaches the backends for writing is encrypted if the user chooses so. We could compile the storage backends as shared libraries and load them as needed.

Another important aspect I'd be curious about is to see how this all looks like then on the swtpm command line so it's going to fit for all of the possible backends. We could have a backend option like --tpmstate dir=<dir>,backend=file, which is implicitly what it is now or one for s3 like --tpmstate backend=s3,uri=<S3URI> (https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html). Presumably there's then a per-backend config file on the system (~/.config/...,/etc/...) that tells it where and what the credentials are so we don't have to put them on the command line. The config file would be accessed only by the backend.

franciozzy commented 3 years ago

Can you rename the title to 'Feature proposal'?

Done.

I understand the support for S3. Presumably one formulates a per-VM URL where to store the state under.

Exactly what I had in mind. The URL could be a bucket identifier, with each SWTPM "name" translating to an object in the bucket. Alternatively, all state could be a single object. In this case, we'd need some sort of format to identify the different "names" within the object (see below).

How is this supposed to work with iSCSI and multiple VMs' vTPM state and assignments and management of blocks/sectors on a storage device?

What I have in mind is somewhat similar to the second option described for S3 above. SWTPM can be given a URI for an iSCSI LUN. The LUN (for an iSCSI disk) is just a storage area where you can read/write sector-sized data. It then comes down to defining a format for the storage area. Given there's only one process accessing the area at a time, we don't have to worry about concurrency and the format could be trivial. For example:

Sector	Byte offset	Description
0	0	Header
1	512	Allocation table
2	1024	Data
3	1536	Data
...	...	...

Header can contain a magic number, format version, creation date, etc. Whatever metadata we decide is relevant. Allocation table would indicate where the contents are. It could be an array as such:

NIL-terminated name (`TPM_FILENAME_MAX` bytes)	Starting sector (4 bytes)	Length in bytes (4 bytes)

Currently, TPM_FILENAME_MAX is 20 bytes (from src/tpm12/tpm_nvfile.h, is this different for TPM 2?), so each of these entries would have 28 bytes. You can have 18 entries in a single-sector table (if my code reading is right, this should be more than enough as libtpms doesn't use more than 3 names). That leaves 8 bytes at the end, which could be used for a checksum of the table if desirable. Most storage implementations will give you (at least) sector-size atomicity on writes, so that shouldn't be a concern. The checksum could provide some reassurance, though.

Creating a file is done by selecting free contiguous sectors and writing the data to it. The creation is committed by updating the allocation table with the corresponding "Starting sector". Full rewrite of the files can be done identically. This resembles the current implementation where SWTPM writes to a temp file and then renames it. To avoid fragmentation issues, we can give files a max-size and pre-divide the block device accordingly.

This is just a proposal. If we want more metadata about the files, we could look into simple-filesystem libraries instead.

I know that you are very interest in this type of backend, so I'd be curious to learn how this is supposed to work.

Yes, we would be immediately providing code for separating the interface as you explained and the iSCSI module.

Otherwise my suggestion would be that we define an interface for the storage and push all the storage behind that interface, leaving the encryption/decryption to the core code in swtpm and all data before it reaches the backends for writing is encrypted if the user chooses so. We could compile the storage backends as shared libraries and load them as needed.

That sounds like a great plan.

Another important aspect I'd be curious about is to see how this all looks like then on the swtpm command line so it's going to fit for all of the possible backends. We could have a backend option like --tpmstate dir=<dir>,backend=file, which is implicitly what it is now or one for s3 like --tpmstate backend=s3,uri=<S3URI> (https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html).

That's exactly what I have in mind.

Presumably there's then a per-backend config file on the system (~/.config/...,/etc/...) that tells it where and what the credentials are so we don't have to put them on the command line. The config file would be accessed only by the backend.

Either that or we can pass the data as environment variables. Or both, with a priority for the ENV. This helps with security concerns, otherwise you may need to think about protecting the config file between instances of SWTPM (eg. with SELinux).

stefanberger commented 3 years ago

Exactly what I had in mind. The URL could be a bucket identifier, with each SWTPM "name" translating to an object in the bucket. Alternatively, all state could be a single object. In this case, we'd need some sort of format to identify the different "names" within the object (see below).

At least for S3 I think there should be a per-VM bucket or different objects per bucket per VM -- presumably S3 supports that. It would be horrible to have to read a whole binary file and find the location where to insert the blob.

Also, blobs of swtpm do NOT have a constant size. They are <2kb at the beginning and when one fills the TPM with data (keys, NVRAM etc.) it will become bigger (several 10s of kb) and may become bigger as libtpms develops further and/or supports larger key sizes or other algorithms or any oher reason. That said, anything that doesn't support per-VM blobs or where management of locations (including moving of unrelated blobs) need to be done sounds like a 'adventure' to me. I am referring to low level iScsi here where I get the feeling that exactly this needs to be done. In my opinion that's what files or blobs in buckets or buckets are there for...

With all these complications, why is mounting a remote storage backend via iScsi not possible? Are there sharing issues?

The storage backend would be transparent to QEMU, but you'll definitely have to do something different on the libvirt layer if that's your target to support. So do you have a concrete command line for iScsi that you can show?

franciozzy commented 3 years ago

At least for S3 I think there should be a per-VM bucket or different objects per bucket per VM -- presumably S3 supports that. It would be horrible to have to read a whole binary file and find the location where to insert the blob.

My understanding is the same as yours regarding S3: a per-VM bucket is the way to go.

Also, blobs of swtpm do NOT have a constant size. They are <2kb at the beginning and when one fills the TPM with data (keys, NVRAM etc.) it will become bigger (several 10s of kb) and may become bigger as libtpms develops further and/or supports larger key sizes or other algorithms or any oher reason.

This is absolutely fine. The model I pasted above can easily provide sufficient space. We can even allow for more than 18 files to begin with, but it doesn't look like that's needed. If it ever is, we can provide conversion mechanisms for existing LUNs. The header I mentioned can contain the size of a sector, the number of sectors for the allocation table, and the size to pre-allocate for each file. We can be very generous from the get-go and that's not a problem for external storage infrastructures as LUNs are generally thinly-provisioned.

As an example, if we reserve 10 MiB for each file (which sounds relatively future-proof), and support 18 files as proposed above, including the spare space for the "temp entry", that's a volume under 200 MiB total per VM. This should be negligible for a VM's storage footprint.

That said, anything that doesn't support per-VM blobs or where management of locations (including moving of unrelated blobs) need to be done sounds like a 'adventure' to me. I am referring to low level iScsi here where I get the feeling that exactly this needs to be done. In my opinion that's what files or blobs in buckets or buckets are there for...

Certainly for backend stores that support buckets and objects, that is much preferable. What I've proposed above is really not complicated. It's just a simplistic data store to meet the requirements of TPM. If you are uncomfortable with that, we can always have a directory or branch with experimental (or not officially supported) "use at your own risk" plugins. The main ask is for the plugin interface to exist so users can adapt SWTPM to their needs. Worth noting, I often found that maintaining such interfaces and experimental plugins on the official codebase tend to attract contribution and benefit the wider community.

With all these complications, why is mounting a remote storage backend via iScsi not possible? Are there sharing issues?

There are many issues with that. I can discuss some examples. In order to mount the iSCSI LUN you need to 1) attach it via the kernel as a block device; and 2) format it with a filesystem.

The problems are: Stability. For a host running 100s of VMs, this means 100s of network-backed block devices attached through the kernel. A network unavailability will most likely require a host reboot to recover. It could result in 1000s of outstanding block commands hanging and leaving SWTPM instances in a "D" unkillable state.

Operability. Unless you use a shared-disk filesystem, you can't mount the block device in two hosts. This makes live migration virtually impossible without several further changes to how QEMU works with SWTPM today. Shared-disk filesystems require specific host clustering technologies which may not be available.

Security. Once the block devices are mounted, and access to the filesystem is given to the user/group running SWTPM, then restricting access to a single SWTPM instance requires further SELinux coordination.

The storage backend would be transparent to QEMU, but you'll definitely have to do something different on the libvirt layer if that's your target to support.

I haven't yet checked the state of libvirt support for SWTPM, but generally that comes later. We should focus on getting the right interface for SWTPM and then look at libvirt, in my opinion.

So do you have a concrete command line for iScsi that you can show?

I liked what you proposed above and I reckon it would like this: --tpmstate backend=iscsi,uri=iscsi://hostname:port/iqn/lun (see libiscsi's reference). CHAP credentials could be passed via environment variables as previously discussed. If the header of the block device doesn't contain the magic number we expect to see, then the plugin can automatically format the device.

stefanberger commented 3 years ago

My suggestion is then to define an interface for the storage backend while identifying the functions that need to move into it. Then we can morph the code so that it uses this interface for the current default backend and invokes it via a shared library through some sort of client stubs.

franciozzy commented 3 years ago

My suggestion is then to define an interface for the storage backend while identifying the functions that need to move into it.

Thanks. Per original message I think the interface should be the same as currently exposed in swtpm_nvfile.h. If there are no disagreements we'll go ahead with a proposal on that.

Then we can morph the code so that it uses this interface for the current default backend and invokes it via a shared library through some sort of client stubs.

I agree that this is the way to go. New backends could even be disabled by default and available at build-time via configure.ac switches while they are experimental.

stefanberger commented 3 years ago

Though the following may still be something worth considering: qemu-storage-daemon

https://qemu.readthedocs.io/en/latest/tools/qemu-storage-daemon.html

It sounds like it would be able to make storage devices available via FUSE for example, so that one can access the block device via a filesystem. I haven't tried it myself and don't know how well it works with concurrency and solves locking issues and all the consequences of sharing a device among multiple clients. But the abstraction sounds like the right one to me... sorry, I am not the storage expert here, but anything outside of swtpm that solves the issue would still be my preferred solution.

stefanberger commented 3 years ago

The warning on the qemu-storage-daemon site is a bit discouraging though:

Warning: Never modify images in use by a running virtual machine or any other process; this may destroy the image. Also, be aware that querying an image that is being modified by another process may encounter inconsistent state.

stefanberger commented 3 years ago

I think you should consider a FUSE filesystem so that we don't need various types of storage backends in the swtpm code base, which seems more a distraction to me anyway.

One such filesystem could be FUSE syncing to one or multiple S3 buckets. This already exists: https://github.com/s3fs-fuse/s3fs-fuse I think you should design a FUSE filesystem syncing via iScsi. This way we can leave POSIX API calls in swtpm and the magic happens uner the hood of the mounted filesystem.

franciozzy commented 3 years ago

Heya!

Sorry about the delay in getting you some code to show how this will work. It is now under last rounds of internal review prior to posting.

The idea of using FUSE is quite interesting. We discussed it extensively before approaching you with the present idea. I can see from my notes that FUSE was the next preferred proposal, but it had (many) drawbacks if compared to what we propose here.

Firstly from an efficiency point of view this is an unnecessary kernel hop. You don't need: swtpm->vfs->fuse->network when you can have: swtpm->network. Secondly from the manageability point of view, a FUSE service is yet another process that needs to run in the hypervisor. It brings scalability concerns if you need one per VM or stability/security concerns if you have one per host. That is, one per VM means another process that can die, eats resources, needs start/stopping at the right times, etc. And one per host means a single point of failure both from a functional point of view and a single attack surface from a security point of view. Finally, you need a full filesystem implementation that can be used by the FUSE service itself.

One way of reasoning about our proposal is thinking about QEMU itself. It offers many pluggable backends for talking to storage. You can plug iSCSI, S3, NVMe, NFS, etc, to pretty much anything that needs storage (from VM disks to the NVRAM of UEFI). If swtpm were an integral part of QEMU, it would probably already support that.

I hope that when you see the patch you will be less worried about the complexity. Hopefully soon! 🤞

PiMaker commented 3 years ago

Hi everyone!

Just chiming in to note that we also require an interface just like this. Glad to see work is being done, on a quick glance the RFC PR already looks great!

For our usecase specifically it would be necessary to specify a single, pre-allocated block device as the target store. In the future, native Ceph RBD support might also be interesting, though can be done with krbd and the aforementioned.

The model proposed by @franciozzy for iSCSI sounds like a good fit, though I'd go as far as to make that another slim abstraction, which you could put on any "block-device-like" structure, like an actual bdev, a user-space iSCSI, RBD, etc... Might also be worth considering putting a user-defined limit on the maximum number of supported TPMs, AFAIU for VM use-cases that will always stay at a single one (so max 3 files in total, at least at the moment) anyway?

I can see from my notes that FUSE was the next preferred proposal, but it had (many) drawbacks if compared to what we propose here.

If swtpm were an integral part of QEMU, it would probably already support that.

We too had the idea of FUSE, but between some difficulty with our HA stack and the reasons mentioned already we set it aside as well.

If I may ask: What is the actual reason 'swtpm' isn't part of QEMU itself? Having it as a seperate program is certainly a bit of a hassle for VM management.

Worth noting, I often found that maintaining such interfaces and experimental plugins on the official codebase tend to attract contribution and benefit the wider community.

Point in case, even just the RFC caught my attention :) We'd be glad to help out developing with the implementation related to our use-cases.

pksnx commented 3 years ago

Thanks much for chiming in @PiMaker and glad to see that you have similar use case as well! Good ideas on a slim generic block dev abstraction and putting user-defined limits. Happy to hear that you're willing to help out, we're working on iscsi specific backend and would be a good first step to hash out generic block dev abstraction details to start with!

ifnkhan commented 3 years ago

If I may ask: What is the actual reason 'swtpm' isn't part of QEMU itself? Having it as a seperate program is certainly a bit of a hassle for VM management.

it is discussed in https://lists.gnu.org/archive/html/qemu-devel/2013-11/msg02524.html

elmarco commented 3 years ago

Since there are some worries about FUSE and kernel interfaces, have you considered NBD instead?

With a qemu-storage-daemon nbd export, swtpm could use all the available backends.

nicowilliams commented 2 years ago

I wonder if using SQLite3 for the backend wouldn't suffice. There are lots of plugins for backends for SQLite3...

Using SQLite3 has various other benefits as well, such as the ability to access the TPM's state with the sqlite3(1) command-line utility.

However, it would be a bit of a rototill.

stefanberger commented 2 years ago

I think the possibility to add different types of storage backends is there now. I would expect those to contribute a storage backend that need a different one. Closing this issue.