Closed cloudymax closed 1 year ago
Hey! Thanks for reporting this
This sounds similar to previous issues we had like https://github.com/kubevirt/containerized-data-importer/issues/2358,
but the changes that tackle those did not get backported to your CDI version (v1.54.3
).
The author of the PR that fixes this (https://github.com/kubevirt/containerized-data-importer/pull/2701) mentions:
On my machine this reduces a Ubuntu cloud image import from 90 minutes to 1 minute.
If you want to try that change, you can install CDI 1.57 RC
Hey! Thanks for reporting this
This sounds similar to previous issues we had like #2358, but the changes that tackle those did not get backported to your CDI version (
v1.54.3
). The author of the PR that fixes this (#2701) mentions:On my machine this reduces a Ubuntu cloud image import from 90 minutes to 1 minute.
If you want to try that change, you can install CDI 1.57 RC
Thanks for the info, I'll try that out asap π
If the change doesn't work, can you let me know the Ubuntu image URL that you're trying to download and the exact version of nbdkit.
Hey! Thanks for reporting this
This sounds similar to previous issues we had like #2358, but the changes that tackle those did not get backported to your CDI version (
v1.54.3
). The author of the PR that fixes this (#2701) mentions:On my machine this reduces a Ubuntu cloud image import from 90 minutes to 1 minute.
If you want to try that change, you can install CDI 1.57 RC
I'm actually running into a lot of issues attempting to install 1.57 RC, operator fails to start with error:
cdi-apiserver-signer-bundle not found
still troubleshooting π€·
@akalenyu @rwmjones
After making the requested changes I'm still seeing that importing takes 20+ minutes.
I upgraded to the latest kubevirt version and latest cdi:
tried with:
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: "debian"
labels:
app: containerized-data-importer
annotations:
cdi.kubevirt.io/storage.bind.immediate.requested: "true"
cdi.kubevirt.io/storage.import.endpoint: "https://cloud.debian.org/images/cloud/bookworm/daily/latest/debian-12-generic-amd64-daily.qcow2"
spec:
storageClassName: longhorn
accessModes:
- ReadWriteMany
resources:
requests:
storage: 120Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: "jammy"
labels:
app: containerized-data-importer
annotations:
cdi.kubevirt.io/storage.bind.immediate.requested: "true"
cdi.kubevirt.io/storage.import.endpoint: "https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img"
spec:
storageClassName: local-path
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 120Gi
tried on longhorn and local-path, same results.
Logs:
I0721 08:35:51.752839 1 importer.go:103] Starting importer β
β I0721 08:35:51.753025 1 importer.go:172] begin import process β
β I0721 08:35:52.123544 1 data-processor.go:356] Calculating available size β
β I0721 08:35:52.127447 1 data-processor.go:368] Checking out file system volume size. β
β I0721 08:35:52.127553 1 data-processor.go:376] Request image size not empty. β
β I0721 08:35:52.127561 1 data-processor.go:381] Target size 126210801664. β
β I0721 08:35:52.127803 1 nbdkit.go:310] Waiting for nbdkit PID. β
β I0721 08:35:52.628451 1 nbdkit.go:331] nbdkit ready. β
β I0721 08:35:52.628475 1 data-processor.go:255] New phase: Convert β
β I0721 08:35:52.628485 1 data-processor.go:261] Validating image β
β I0721 08:35:52.878701 1 nbdkit.go:293] Log line from nbdkit: nbdkit: curl[1]: error: readahead: warning: underlying plugin does not support NBD_CMD_CACHE or PARALLEL thread model, so th β
β e filter won't do anything β
β E0721 08:35:53.394136 1 prlimit.go:155] failed to kill the process; os: process already finished β
β I0721 08:35:53.396582 1 qemu.go:258] 0.00 β
β I0721 08:35:53.396986 1 nbdkit.go:293] Log line from nbdkit: nbdkit: curl[2]: error: readahead: warning: underlying plugin does not support NBD_CMD_CACHE or PARALLEL thread model, so th β
β e filter won't do anything β
β I0721 08:36:08.120708 1 qemu.go:258] 1.00 β
β I0721 08:36:21.590231 1 qemu.go:258] 2.01
I'll keep playing around with things and see if I can get it to work or at least break in a different way
Can you run nbdkit --version
Sure,
on the host machine:
> nbdkit --version
-bash: nbdkit: command not found
in the container:
> nbdkit --version
nbdkit 1.34.1 (nbdkit-1.34.1-1.el9)
The fundamental problem is still qemu-img convert
lacking support for NBD multi-conn which is not something we can fix in nbdkit.
But anyway a while back I had this patch to nbdkit to enable parallel curl plugin. It had some unfortunate effects because of fairness which I describe in the thread there. I looked at it again and I think we should give it another go, so I pushed this commit:
https://gitlab.com/nbdkit/nbdkit/-/commit/f2163754860d041c4cb12dace90591c280eccae8
With this patch and upping the default number of connections (connections=8
) it's goes from 20 mins to about 4 mins, which while still not great (see above about qemu) is probably the best we can do until qemu gets fixed.
I've finally worked out what's going on here. About 3/4 of the time is spent doing 302 redirects from the orignal URL located on cloud.debian.org
to the mirror (which for me happens to be saimei.ftp.acc.umu.se
, but I guess that depends where you are located in the world).
The way curl works is it doesn't cache 302 redirects, so basically on every single block that we request we're doing a whole new redirect.
If you "manually" resolve the redirect first then for me it is ~ 4 times faster, and near the speed of plain wget.
It turns out we've actually hit this bug before in a slightly different context:
https://bugzilla.redhat.com/show_bug.cgi?id=2013000
I proposed a patch for that bug, adding an effective-url=true
flag which would do the pre-resolving once inside the curl plugin. There were various problems with that patch which you can read about here:
https://listman.redhat.com/archives/libguestfs/2021-October/026882.html
I wonder if CDI can do a simple HEAD request to the site that is submitted in order to resolve redirects, and then use the resolved URI when opening nbdkit? It should be fairly trivial if CDI has access to the curl library. I can suggest some code for this if you like. It would also be useful to give better errors for things like 404s.
I've finally worked out what's going on here. About 3/4 of the time is spent doing 302 redirects from the orignal URL located on
cloud.debian.org
to the mirror (which for me happens to besaimei.ftp.acc.umu.se
, but I guess that depends where you are located in the world).The way curl works is it doesn't cache 302 redirects, so basically on every single block that we request we're doing a whole new redirect.
If you "manually" resolve the redirect first then for me it is ~ 4 times faster, and near the speed of plain wget.
It turns out we've actually hit this bug before in a slightly different context:
https://bugzilla.redhat.com/show_bug.cgi?id=2013000
I proposed a patch for that bug, adding an
effective-url=true
flag which would do the pre-resolving once inside the curl plugin. There were various problems with that patch which you can read about here:https://listman.redhat.com/archives/libguestfs/2021-October/026882.html
I wonder if CDI can do a simple HEAD request to the site that is submitted in order to resolve redirects, and then use the resolved URI when opening nbdkit? It should be fairly trivial if CDI has access to the curl library. I can suggest some code for this if you like. It would also be useful to give better errors for things like 404s.
Interesting! I see some old PRs where we pulled the plug on CDI doing the resolves: https://github.com/kubevirt/containerized-data-importer/pull/1981
@awels may have more context
The two issues are overlapping but slightly distinct. In BZ 2013000 / #1981, the problem was with Fedora redirects being flaky, sometimes resolving to a 404 mirror. We decided in the end to solve that by using the nbdkit-retry-request-filter which basically retries single requests until we get a good mirror. In that case pre-resolving the URL was thought to be bad (because we might pre-resolve it to a 404, and then it wouldn't work at all) and so #1981 was not merged. In this case the Debian mirrors all appear to be good (ie they don't resolve to 404s that I've seen), but it's the issue that we're doing the resolves on every block (curl internally doesn't cache them) and they happen to be especially slow on the central Debian server, so pre-resolving would be a good thing.
This also affects Debian Live ISOs, Ubuntu cloud and Live ISOs, and Windows ISOs in my testing.
Examples here: https://github.com/small-hack/argocd/tree/main/kubevirt/disks
Source urls used:
----
Debian:
12: "https://cloud.debian.org/images/cloud/bookworm/daily/latest/debian-12-generic-amd64-daily.qcow2"
Ubuntu:
focal: "https://cloud-images.ubuntu.com/focal/current/focal-server-cloudimg-amd64.img"
jammy: "https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img"
lunar: "https://cloud-images.ubuntu.com/lunar/current/lunar-server-cloudimg-amd64.img"
Windows:
10: "https://www.itechtics.com/?dl_id=173"
11: "https://www.itechtics.com/?dl_id=168"
Happy to apply any patches and or try new urls to provide more data π
I can also grant access to a test cluster if that would help
https://www.itechtics.com/?dl_id=173
IIUC it could be helpful to use a resolved URL to try to verify this indeed helps in these cases, so for the Windows images for example, you could try inputting the result of this into CDI:
curl -LIs -o /dev/null -w %{url_effective} https://www.itechtics.com/?dl_id=173
Yes please stick any URLs you find which are problematic into a comment here.
The Debian one, as discussed. You can just replace the URL in the configuration file with one which has been resolved (curl -I https://cloud.debian.org/images/cloud/bookworm/daily/latest/debian-12-generic-amd64-daily.qcow2
) which should mostly fix the problem.
The Ubuntu ones are pretty weird. For wget I get 2m13. For nbdkit the numbers vary wildly from a best case of 2m5 (faster than wget) to 8 minutes or more. It is not redirecting, so the problem is distinct from the Debian URL. I'm still looking at this and the Windows ones.
The problem with the Windows ones is redirecting, same as Debian. @akalenyu 's suggestion is right there.
The Ubuntu download issue is now completely unreproducible. nbdkit is about 1-2 seconds slower than wget for me. Edit: nbdkit is faster than wget for me (1m21 for wget vs 1m16 for nbdkit, both numbers fairly stable).
I'm going to say at this point it might be some kind of load balancing issue their end. I'll run the same command tomorrow morning to see if I get a different server and can reproduce it.
Thanks so much for all your hard work on this @rwmjones @akalenyu
I'll be testing the resolved URLs today, I tried last night but ended up hitting the pod limit on our test node π
nbdkit is even faster this morning ...
$ time ./nbdkit -r -U - curl https://cloud-images.ubuntu.com/lunar/current/lunar-server-cloudimg-amd64.img --run 'nbdcopy -p $uri null:'
β 100% [****************************************]
real 1m10.973s
user 0m1.613s
sys 0m1.976s
I should note that I'm running a slightly modified nbdkit. The curl plugin uses a parallel thread model which seems to make it a bit faster. And I gave in and wrote a qcow2 filter.
nbdkit is even faster this morning ...
$ time ./nbdkit -r -U - curl https://cloud-images.ubuntu.com/lunar/current/lunar-server-cloudimg-amd64.img --run 'nbdcopy -p $uri null:' β 100% [****************************************] real 1m10.973s user 0m1.613s sys 0m1.976s
I should note that I'm running a slightly modified nbdkit. The curl plugin uses a parallel thread model which seems to make it a bit faster. And I gave in and wrote a qcow2 filter.
I'm able to achieve similar numbers to this if i download and install nbdkit on the host machine and run the same command. However, when the process runs in a container after being triggered by creating a PVC, the creation time is 12 minutes.
This is faster than the 20+ minutes it was prior to using a resolved URL for the CDI, but it's still nearly 10x longer than running the command directly on the host, so the declarative workflow still isn't viable unfortunately
Time: 12:03
url: "https://laotzu.ftp.acc.umu.se/images/cloud/bookworm/latest/debian-12-generic-amd64.qcow2"
Method:
kubectl apply -f debian12-pvc.yaml
debian12-pvc.yaml:
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: debian12-pvc
namespace: vm0
labels:
app: containerized-data-importer
annotations:
cdi.kubevirt.io/storage.bind.immediate.requested: "true"
cdi.kubevirt.io/storage.import.endpoint: "https://laotzu.ftp.acc.umu.se/images/cloud/bookworm/latest/debian-12-generic-amd64.qcow2"
spec:
storageClassName: local-path
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
Time: 1:18
url: "https://laotzu.ftp.acc.umu.se/images/cloud/bookworm/latest/debian-12-generic-amd64.qcow2"
Version:
nbdkit --version
nbdkit 1.32.5
Method:
time nbdkit -r -U - curl https://laotzu.ftp.acc.umu.se/images/cloud/bookworm/latest/debian-12-generic-amd64.qcow2 --run 'nbdcopy --no-extents -p "$uri" null:'
β 100% [****************************************]
real 1m18.763s
user 0m0.476s
sys 0m0.515s
Time: 11:52
url: "https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img""
Method:
kubectl apply -f jammy-pvc.yaml
jammy-pvc.yaml:
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: jammy-pvc
namespace: vm0
labels:
app: containerized-data-importer
annotations:
cdi.kubevirt.io/storage.bind.immediate.requested: "true"
cdi.kubevirt.io/storage.import.endpoint: "https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img"
spec:
storageClassName: local-path
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
Time: 2:21
url: "https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img"
Version:
nbdkit --version
nbdkit 1.32.5
Method:
time nbdkit -r -U - curl https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img --run 'nbdcopy --no-extents -p "$uri" null:'
β 100% [****************************************]
real 2m21.936s
user 0m5.608s
sys 0m1.067s
CDI is using a fairly old version of nbdkit. Looks like 1.34.1-1 we should look into upgrading to a recent version.
Does CDI run nbdkit inside a VM, or is it running straight in a container on the host? I don't think running inside a VM should be a problem (there's no virt involved with nbdkit), but be good to understand exactly where it is running.
nbdkit 1.30's curl plugin lacks at least:
Does CDI run nbdkit inside a VM, or is it running straight in a container on the host? I don't think running inside a VM should be a problem (there's no virt involved with nbdkit), but be good to understand exactly where it is running.
nbdkit 1.30's curl plugin lacks at least:
* Connection pool support * Multi-conn support * Parallel thread model (not in 1.34)
My tests were run on a bare-metal install of K3s, so the only virtualization layer is the container π
CDI runs inside a container on the host. So I just double checked, it looks like 1.57 is running 1.34 nbdkit, and 1.56 is running 1.30. And I think we need to add a plugin to 1.34 that is not needed in 1.30.
@cloudymax could you try 1.56 instead of 1.57-rc1 to see if it works better?
@cloudymax I think the Ubuntu image case is really different from the Debian & Windows image cases (caused by redirects). Probably we need a new bug for that.
Are you able to reliably reproduce slow download of the Ubuntu image with nbdkit >= 1.32 (ideally 1.34)? I'm still getting faster than wget performance this afternoon.
$ time nbdkit -r -U - curl https://cloud-images.ubuntu.com/lunar/current/lunar-server-cloudimg-amd64.img --run 'nbdcopy -p $uri null:'
As everyone else is in different places around the world you might hit a different Ubuntu mirror and get the super-slow case.
If you get the super-slow case, then add these flags and capture the output (which is likely to be very large):
-f -v -D curl.verbose=1
I just tried with CDI 1.56 and it is still super slow there....
@cloudymax I think the Ubuntu image case is really different from the Debian & Windows image cases (caused by redirects). Probably we need a new bug for that.
Are you able to reliably reproduce slow download of the Ubuntu image with nbdkit >= 1.32 (ideally 1.34)? I'm still getting faster than wget performance this afternoon.
$ time nbdkit -r -U - curl https://cloud-images.ubuntu.com/lunar/current/lunar-server-cloudimg-amd64.img --run 'nbdcopy -p $uri null:'
As everyone else is in different places around the world you might hit a different Ubuntu mirror and get the super-slow case.
If you get the super-slow case, then add these flags and capture the output (which is likely to be very large):
-f -v -D curl.verbose=1
The comparison is not nbdkit VS wget as a race to download the image fastest at the command line on the host. Even if it was, nbkit taking +/- 1 minute is still far slower than the 7 seconds it takes for wget to download the image + 25 seconds for virtctl to create the PVC and then re-upload the image.
time nbdkit -r -U - curl https://cloud-images.ubuntu.com/lunar/current/lunar-server-cloudimg-amd64.img --run 'nbdcopy -p $uri null:'
β 100% [****************************************]
real 2m48.143s
user 0m6.562s
sys 0m1.317s
export VOLUME_NAME=jammy-pvc
export NAMESPACE="default"
export STORAGE_CLASS="local-path"
export ACCESS_MODE="ReadWriteOnce"
export IMAGE_URL="https://cloud-images.ubuntu.com/lunar/current/lunar-server-cloudimg-amd64.img "
export IMAGE_PATH="lunar-server-cloudimg-amd64.img"
export VOLUME_TYPE=pvc
export SIZE=5Gi
export PROXY_ADDRESS=$(kubectl get svc cdi-uploadproxy-loadbalancer -n cdi -o json | jq --raw-output '.spec.clusterIP')
# $(kubectl get svc cdi-uploadproxy -n cdi -o json | jq --raw-output
time wget -O $IMAGE_PATH $IMAGE_URL && \
time virtctl image-upload $VOLUME_TYPE $VOLUME_NAME \
--size=$SIZE \
--image-path=$IMAGE_PATH \
--uploadproxy-url=https://$PROXY_ADDRESS:443 \
--namespace=$NAMESPACE \
--storage-class=$STORAGE_CLASS \
--access-mode=$ACCESS_MODE \
--insecure --force-bind
--2023-07-25 16:16:45-- https://cloud-images.ubuntu.com/lunar/current/lunar-server-cloudimg-amd64.img
Resolving cloud-images.ubuntu.com (cloud-images.ubuntu.com)... 2620:2d:4000:1::17, 2620:2d:4000:1::1a, 185.125.190.40, ...
Connecting to cloud-images.ubuntu.com (cloud-images.ubuntu.com)|2620:2d:4000:1::17|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 816316416 (778M) [application/octet-stream]
Saving to: βlunar-server-cloudimg-amd64.imgβ
lunar-server-cloudimg-a 100%[===============================>] 778.50M 103MB/s in 7.9s
2023-07-25 16:16:53 (99.1 MB/s) - βlunar-server-cloudimg-amd64.imgβ saved [816316416/816316416]
real 0m7.993s
user 0m0.404s
sys 0m0.893
PVC default/jammy-pvc not found
PersistentVolumeClaim default/jammy-pvc created
Waiting for PVC debian12-pvc upload pod to be ready...
Pod now ready
Uploading data to https://10.43.88.176:443
778.50 MiB / 778.50 MiB [==========================================================] 100.00% 1s
Uploading data completed successfully, waiting for processing to complete, you can hit ctrl-c without interrupting the progress
Processing completed successfully
Uploading lunar-server-cloudimg-amd64.img completed successfully
real 0m25.071s
user 0m0.203s
sys 0m0.291s
This is all irrelevant though because the actual issue I have is that time-to-creation of a new PVC via the CDI when using a declarative workflow is consistently over 10 minutes regardless of image size, URL, disk speed, download bandwidth, or CPU core count.
Something is wrong within the containerized CDI process that is not present on the command line.
real 2m48.143s
Which version of nbdkit is this?
real 2m48.143s
Which version of nbdkit is this?
On the host the version is nbdkit 1.32.5
and take 1-2 minutes. it is the latest version of nbdkit available through debian apt.
In the container it is nbdkit 1.34.1 (nbdkit-1.34.1-1.el9)
this is the version provided by the CDI version 1.57-rc1. That one takes 10+ minutes
Can you please test in a cluster using the actual CDI declarative flow and not on the command line?
This is what is started inside the CDI pod
Start nbdkit with: ['--foreground' '--readonly' '--exit-with-parent' '-U' '/tmp/nbdkit.sock' '--pidfile' '/tmp/nbdkit.pid' '--filter=readahead' '--filter=retry' '-r' 'curl' 'header=User-Agent: cdi-nbdkit-importer' 'retry-exponential=no' 'url=https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img']
Thanks for that. I believe my local tests are limited by my network, so I grabbed a Beaker machine which I can use instead which has a much better internet connection (and I think is physically located in the US). I can reproduce the issue much more easily now.
I eliminated nbdkit from the equation and I am running the go http client to directly get the data from inside the pod, and it was still slow until I increased the amount of memory available to the pod, so I used this yaml to create the DataVolume
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
name: jammy-pvc
annotations:
cdi.kubevirt.io/storage.bind.immediate.requested: "true"
spec:
source:
http:
# certConfigMap with an empty cert is to work around this issue:
# https://github.com/kubevirt/containerized-data-importer/issues/2358
certConfigMap: empty-cert
url: https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img
pvc:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
---
apiVersion: v1
kind: ConfigMap
metadata:
name: empty-cert
data:
ca.pem: ""
And I increased the amount of memory allocated to the CDI importer pod by modifying the CDI CR to look like this
apiVersion: cdi.kubevirt.io/v1beta1
kind: CDI
metadata:
name: cdi
spec:
config:
featureGates:
- HonorWaitForFirstConsumer
podResourceRequirements:
limits:
cpu: "4"
memory: 2Gi
requests:
cpu: "1"
memory: 250Mi
imagePullPolicy: IfNotPresent
infra:
nodeSelector:
kubernetes.io/os: linux
tolerations:
- key: CriticalAddonsOnly
operator: Exists
workload:
nodeSelector:
kubernetes.io/os: linux
In particular note the podResourceRequirements
which increased the amount of memory and cpu available. I went a little overboard, it doesn't require that much, but apparently the defaults are just not enough. This allowed me to import using the http go client into the scratch space, and convert from there. It took maybe a minute if that on my local machine which a 1G fiber connection. The emptyCert
is to trick CDI into thinking it cannot use nbdkit and fall back to the scratch space method which is a straight http download and then convert.
Thanks for that. I believe my local tests are limited by my network, so I grabbed a Beaker machine which I can use instead which has a much better internet connection (and I think is physically located in the US). I can reproduce the issue much more easily now.
Note that the client CDI is using is qemu-img, which is likely the problem in this scenario... 8.0.0-4 is the current version for 1.57
Quick question: Does the container environment have broken IPv6? The Ubuntu URL resolves to IPv6 then IPv4, and curl tries the IPv6 addresses first. In the Beaker test environment IPv6 is broken and this causes a delay. I added a feature upstream to force IPv4 which improves performance.
Anyway, I'm on holiday today, will take another look at this tomorrow.
Quick question: Does the container environment have broken IPv6? The Ubuntu URL resolves to IPv6 then IPv4, and curl tries the IPv6 addresses first. In the Beaker test environment IPv6 is broken and this causes a delay. I added a feature upstream to force IPv4 which improves performance.
Anyway, I'm on holiday today, will take another look at this tomorrow.
I cant provide an authoritative answer here. I can say that the Host machine has functional IPV6 (its default for hetzner instances) I'll try to get some more info on this π
I reimplemented the plugin using curl "multi" handle support, and the results are way better: https://listman.redhat.com/archives/libguestfs/2023-July/032167.html (see patch 2 for some test timings)
apiVersion: cdi.kubevirt.io/v1beta1 kind: DataVolume metadata: name: jammy-pvc annotations: cdi.kubevirt.io/storage.bind.immediate.requested: "true" spec: source: http: # certConfigMap with an empty cert is to work around this issue: # https://github.com/kubevirt/containerized-data-importer/issues/2358 certConfigMap: empty-cert url: https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img pvc: accessModes: - ReadWriteOnce resources: requests: storage: 5Gi --- apiVersion: v1 kind: ConfigMap metadata: name: empty-cert data: ca.pem: ""
I was able to replicate this on my side and get 1:02 for the pvc creation which gets me unblocked on my end for now. Thanks a bunch!
I reimplemented the plugin using curl "multi" handle support, and the results are way better: https://listman.redhat.com/archives/libguestfs/2023-July/032167.html (see patch 2 for some test timings)
awesome work π
nbdkit changes are upstream in 1.35.8, mainly this commit: https://gitlab.com/nbdkit/nbdkit/-/commit/a74b289ee15a7c75dceb8a96403f1fa8ebd72e88
I did some tests with the URLs under discussion and we are better than wget for the Debian and Ubuntu cases. The Windows URL is weird, more below.
https://cloud-images.ubuntu.com/lunar/current/lunar-server-cloudimg-amd64.img (HTTP/1.1)
nbdkit before change: 180s nbdkit after change, connections=16: 42.4s nbdkit after change, connections=32: 21.6s wget: 20.2s nbdkit after change, connections=64: 17.2s
https://gemmei.ftp.acc.umu.se/images/cloud/bookworm/daily/latest/debian-12-generic-amd64-daily.qcow2 (HTTP/1.1)
nbdkit before change: 100.6s nbdkit after change, connections=16: 23.7s wget: 20.9s nbdkit after change, connections=32: 14.0s nbdkit after change, connections=64: 9.1s
https://www.itechtics.com/?dl_id=173 This URL redirects twice, one to windowstan.com which is proxied by cloudflare, and a second time to software.download.prss.microsoft.com. Unfortunately the second redirect is not an actual HTTP redirect, but a message in the body of the 502 error message that cloudflare frequently sends if you attempt any automated download. So I did these tests using the microsoft.com URL directly. The microsoft site uses HTTP/2.
nbdkit before change if you use the windowstan.com URL: over 80 minutes, and very unreliable nbdkit before change if you use the microsoft.com URL: 3m03 nbdkit after change, connections=1: 3m24 nbdkit after change, connections=2: 3m05 nbdkit after change, connections=16: 3m04 wget: 1m04
Some tips:
ipresolve=v4
to force IPv4 which will save a bit of time.connections
parameter controls the number of HTTP connections to the remote server. Reducing will put less load on / more friendly to the server, increasing will make things faster. Default is 16.Oh and we found and fixed an actual bug in curl: https://github.com/curl/curl/pull/11557
Great unfortunately the latest version available in centos9 stream seems to be 1.34.1 and since our container is build using a centos9 stream image we can't grab 1.35.8 yet.
Totally understood. If you would like to test this, I could do a scratch build in CentOS Stream -- if you have a way to consume that.
For getting the change into RHEL in general we'll need a BZ. I would like to have the change sit in Fedora for a while to get some testing, as it's an almost complete rewrite of the plugin.
We have some automation that essentially just pulls the latest released RPMs and updates our build to use that. So we don't really have a good way to consume scratch builds like that. So for now I think we will use scratch space, and once we can consume the fix, we can re-enable nbdkit for those flows where it makes sense.
Do you (or can you) pull from C9S repos? I could provide an update there very quickly. It's getting it into RHEL which is the hard part.
We normally pull from http://mirror.stream.centos.org/9-stream/
Ok, let me see what I can do today/tomorrow.
There should be something to test here in about an hour: https://kojihub.stream.centos.org/koji/taskinfo?taskID=2603284
I'm not sure exactly if those builds go directly into http://mirror.stream.centos.org/9-stream/ or if they have to go through RHEL OSCI processes first. I guess in the best case there will be at least some kind of compose delay which probably takes a day or two.
Cool, I will check in a few days then. In the meantime switching to scratch space for 1.57, and once that is available will use nbdkit again in main, and likely 1.58
So I grabbed the nbdkit version from there and I am trying it, and I think I am missing some plugins or filters. Which plugins/filters should I use?
I think only nbdkit-retry-request-filter is relevant. It will retry single requests if the server is unavailable.
Note that performance is still going to suck unless you "pre-resolve" the URLs. Suggested above was a command to do this:
curl -LIs -o /dev/null -w %{url_effective} URL
(or equivalent libcurl APIs).
Edit: So re-reading the question, I wonder if you didn't have the separate nbdkit-curl-plugin
package installed? Since curl is an extra dependency (ie. the libcurl library) we don't include the curl plugin in the basic-plugins package.
Edit #2: It looks as if the new version isn't present on https://mirror.stream.centos.org/9-stream/AppStream/x86_64/os/Packages/ yet. The new package is 1.34.2-1.el9
I grabbed the following rpms since we were using them before:
https://kojihub.stream.centos.org/kojifiles/packages/nbdkit/1.34.2/1.el9/x86_64/nbdkit-basic-filters-1.34.2-1.el9.x86_64.rpm
https://kojihub.stream.centos.org/kojifiles/packages/nbdkit/1.34.2/1.el9/x86_64/nbdkit-curl-plugin-1.34.2-1.el9.x86_64.rpm
https://kojihub.stream.centos.org/kojifiles/packages/nbdkit/1.34.2/1.el9/x86_64/nbdkit-gzip-filter-1.34.2-1.el9.x86_64.rpm
https://kojihub.stream.centos.org/kojifiles/packages/nbdkit/1.34.2/1.el9/x86_64/nbdkit-server-1.34.2-1.el9.x86_64.rpm
I don't think we are using the gzip filter anymore, but just in case. pre-resolving made it a little faster, but not significantly. If I curl the image it takes about 7 seconds locally. I still think the problem is likely in the qemu-img we are using.
This is the nbdkit command running in the pod I0804 19:36:36.403006 1 nbdkit.go:251] Start nbdkit with: ['--foreground' '--readonly' '--exit-with-parent' '-U' '/tmp/nbdkit.sock' '--pidfile' '/tmp/nbdkit.pid' '--filter=cache' '--filter=readahead' '--filter=retry' '-r' 'curl' 'header=User-Agent: cdi-nbdkit-importer' 'retry-exponential=no' 'url=https://southfront.mm.fcix.net/fedora/linux/releases/38/Cloud/x86_64/images/Fedora-Cloud-Base-38-1.6.x86_64.qcow2']
What happened: There is a massive performance variance between CDI image import methods which makes the declarative workflow for PVCs/DataVolumes unusable. I'm REALLY trying to make Kubevirt work as a replacement for qemu wrapped in bash but the slowness of drive creation is making it impossible to achieve buy-in from others.
You can find my full setup here: https://github.com/cloudymax/argocd-nvidia-lab/tree/main/kubevirt
Disk configs are specifically here: https://github.com/small-hack/argocd/tree/main/kubevirt/disks
CDI configuration is here: https://github.com/small-hack/argocd/tree/main/kubevirt/cdi
This has been tested across multiple hardware types:
Hetzner AX102 Instance with R97950X3D + NVME (Raid0)
Dell XPS with i7-11700 + NVME (no raid)
Hetzner instance with i7 7700 + SATA6 SSDs (Raid0)
Has been tested on K3s (bare-metal and in a KVM VPS)
Has been tested with local-path and Longhorn storage classes
The CDI has been tested with nodePort and LoadBalancer type
I have increased the resources available to the CDI importer but it had no effect, the process will not consume more resources.
Method: wget + virtctl Storage Class: local-path Time: 22 seconds
Method: wget + virtctl Storage Class: longhorn Time: 56 seconds
Command used:
When using the declarative method the process takes over 20 minutes
Logs attached: Explore-logs-2023-07-15 10 36 08.txt
What you expected to happen:
I would expect this process to be MUCH faster. For example, when performing this task using just plain QEMU + Bash it takes only a couple seconds, and the VM is booted and ready to login in ~30 seconds total on the same hardware mentioned above.
Source: https://github.com/cloudymax/Scrap-Metal/blob/main/virtual-machines/vm.sh
Environment:
kubectl get deployments cdi-deployment -o yaml
): v1.54.3kubectl version
): v1.27.3+k3s1uname -a
): 6.1 and 5.15