kubevirt / containerized-data-importer

Data Import Service for kubernetes, designed with kubevirt in mind.
Apache License 2.0
407 stars 256 forks source link

Feature Request: allow configuration of qemu-img convert's cache type option (-t) #1414

Closed kenshaw closed 4 months ago

kenshaw commented 3 years ago

Is this a BUG REPORT or FEATURE REQUEST?:

/kind enhancement

What happened:

The cdi-importer command (from code in pkg/image/qemu.go) hard codes qemu-img import's cache option to -t none. On a bare metal cluster, using NFSv4.1, this causes qemu-img to wait for verification of the written data. For large imported images, this can take an extraordinary (and frustrating!) amount of time, wildly disproportionate to the expected speed of the underlying networks/nodes/storage (10Gbit/EPYC 32core/ZFS raid5, capable 500+ MB/s throughput).

As per a discussion on the K8s #virtualization channel with Mike Henriksen, the option was hardcoded/added via this PR. Per Mike, this was in order to prevent issues with the imported images not being fully written prior to the termination of the importer pod.

In my (humble) opinion, this should be configurable via the CDI deployment, as the CDI user should be able to choose the risk profile they want their cluster to operate with.

Using the bare metal cluster my company has, I did a couple of tests to demonstrate the difference in speed between the various options. From a running importer instance, I grabbed the generated qemu-img run command, which was the following:

qemu-img convert -t none -p -O raw json: {"file.driver": "http", "file.url": "http://[REDACTED]/cloud-images/focal-server-cloudimg-amd64.img", "file.timeout": 3600} /data/disk.img

(note that the [REDACTED] URL above is a local HTTP mirror on the same 10Gbit network)

Then, using this pod definition:

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: cdi-test
spec:
  storageClassName: nfs-client
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 100Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: cdi-test
spec:
  restartPolicy: Never
  containers:
  - name: cdi-test
    image: kubevirt/cdi-importer:v1.23.5
    imagePullPolicy: IfNotPresent
    command: ["tail", "-f", "/dev/null"]
    volumeMounts:
    - name: cdi-test
      mountPath: /cdi-test
  volumes:
    - name: cdi-test
      persistentVolumeClaim:
        claimName: cdi-test

I manually executed (via kubectl exec -it cdi-test -- /bin/bash) the following, to show the difference in speed between the different options:

bash-5.0# time qemu-img convert -t none -p -O raw http://[REDACTED]/cloud-images/focal-server-cloudimg-amd64.img test1.img
    (100.00/100%)

real    8m10.838s
user    0m9.861s
sys 0m2.482s

bash-5.0# time qemu-img convert -t writeback -p -O raw http://[REDACTED]/cloud-images/focal-server-cloudimg-amd64.img test2.img
    (100.00/100%)

real    0m19.724s
user    0m8.558s
sys 0m3.552s

What you expected to happen:

Allow configuration of qemu-img convert's -t option via the CDI deployment.

Ideally, it should have the following options available (copied from here):

cache mode unspecified
In qemu-kvm versions older than v1.2 (eg SLES11 SP2), not specifying a cache mode meant that writethrough would be used as the default. Since that version, the various qemu-kvm guest storage interfaces have been fixed to handle writeback or writethrough semantics more correctly, allowing for the default caching mode to be switched to writeback. The guest driver for each of ide, scsi, and virtio have within their power to disable the write back cache, causing the caching mode used to revert to writethrough. The typical guest's storage drivers will maintain the default caching mode as writeback, however.

cache = writethrough
This mode causes qemu-kvm to interact with the disk image file or block device with O_DSYNC semantics, where writes are reported as completed only when the data has been committed to the storage device. The host page cache is used in what can be termed a writethrough caching mode. The guest's virtual storage adapter is informed that there is no writeback cache, so the guest would not need to send down flush commands to manage data integrity. The storage behaves as if there is a writethrough cache.

cache = writeback
This mode causes qemu-kvm to interact with the disk image file or block device with neither O_DSYNC nor O_DIRECT semantics, so the host page cache is used and writes are reported to the guest as completed when placed in the host page cache, and the normal page cache management will handle commitment to the storage device. Additionally, the guest's virtual storage adapter is informed of the writeback cache, so the guest would be expected to send down flush commands as needed to manage data integrity. Analogous to a raid controller with RAM cache.

cache = none
This mode causes qemu-kvm to interact with the disk image file or block device with O_DIRECT semantics, so the host page cache is bypassed and I/O happens directly between the qemu-kvm userspace buffers and the storage device. Because the actual storage device may report a write as completed when placed in its write queue only, the guest's virtual storage adapter is informed that there is a writeback cache, so the guest would be expected to send down flush commands as needed to manage data integrity. Equivalent to direct access to your hosts' disk, performance wise.

cache = unsafe
This mode is similar to the cache=writeback mode discussed above. The key aspect of this “unsafe” mode, is that all flush commands from the guests are ignored. Using this mode implies that the user has accepted the trade-off of performance over risk of data loss in the event of a host failure. Useful, for example, during guest install, but not for production workloads.

cache=directsync
This mode causes qemu-kvm to interact with the disk image file or block device with both O_DSYNC and O_DIRECT semantics, where writes are reported as completed only when the data has been committed to the storage device, and when it is also desirable to bypass the host page cache. Like cache=writethrough, it is helpful to guests that do not send flushes when needed. It was the last cache mode added, completing the possible combinations of caching and direct access semantics.

How to reproduce it (as minimally and precisely as possible):

Do a standard installation of CDI using NFS persistent volumes, and time the import of an image.

Anything else we need to know?:

I plan on writing a PR for this in the next few days.

Environment:

awels commented 3 years ago

Hi,

Would expect to be able to configure this on several levels or just one global level. There are basically 3 options

  1. Global level. If this is set at a global level, then any operation on any storage will have this.
  2. Per storage class level. If this is set then only operations on a particular storage class will have this.
  3. Per DataVolume level. Its basically a new flag on the DataVolume and only that particular DataVolume is affected.

Not giving it too much thought, I would expect 1 to be some default value (maybe what it currently is, but set-able by the user), and maybe allow an override in CDIConfig for a per storage class level. I suspect if you have several different storage classes they would want different settings. I would think 3 is overkill

awels commented 3 years ago

We recently (yesterday) merged a PR that allows one to specify the file system overhead on a global and a per SC level. With a structured storage class map in CDIConfig. I think it would make sense to leverage that and extend the struct with a cache mode field which can then be passed to the various import/clone/upload pods on a per storage class basis.

maya-r commented 3 years ago

I'd like to consider whether this can be a default rather than a tunable. Syncing once at the end should be safe and faster than doing it multiple times in the process of the write, which is what we're probably doing right now.

Should probably explicitly ping @aglitke to read this thread as he changed the behaviour, if he hadn't seen it yet.

maya-r commented 3 years ago

so, in the case of NFS, it is always safe to not synchronize the cache, not even with a sync command at the end. NFS guarantees close-to-open consistency, so any opens done after dd closes the fd (at exit) will see the new content. I don't see why other remote file systems wouldn't promise at least that.

A regular file system on top of a remote block (XFS on ceph RBD, iSCSI, etc.) needs some extra help, because the file system layer doesn't have the information that it is remote. While a sync only flushes all pending writes, this is sufficient because we are purely writing.

Since this is a pure RWX+Block problem, I thought about making it conditional, but a single sync at the end is cheap enough it's probably not worth the hassle of making everything conditional.

kubevirt-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

maya-r commented 3 years ago

/lifecycle frozen

maya-r commented 2 years ago

Do you still have performance issues after #1919?

kenshaw commented 2 years ago

I have not tested this; and not likely to in the next few days. As per the original issue (and as #1919 demonstrates), I still think this should be a user-configurable option as various data storage systems are going to have significantly different performance and require separate configuration/tuning on a per-cluster basis.

awels commented 1 year ago

Created JIRA card https://issues.redhat.com/browse/CNV-26195 to handle this issue.

aglitke commented 4 months ago

We don't want to expose the cache mode API to end users because understanding the implications of this setting is not trivial. We have since switched the cache mode to writeback and have a PR open to fall back to cache=none only if we detect it is needed. At this time I think the combination of these two things should be enough. I am going to close this issue since it is pretty old. Please reopen if you would like to continue discussing.