Closed ksimon1 closed 11 months ago
Is it related to https://github.com/kubevirt/containerized-data-importer/issues/2836?
I can't say, because this error is happening during first import of disk from quay into cluster. The #2836 is happening during clone of the imported disk (the one about which is this bug) to new dataVolume
Did anything change in the push process of the common template images? Usually, the disk image file is 107:107 but in this case, I see BTW, why do you need the other files in the container image? (boot/,dev/ etc)
The images are created the same method as in the past:
FROM kubevirt/container-disk-v1alpha
ADD centos-stream-9.qcow2 /disk
Are you able to reproduce this with the standard containerDisk images (ie. quay.io/containerdisks/centos-stream:9)? This definitely seems like an issue with this specific containerDisk.
yes, centos stream 8, 9, ubuntu, rhcos (all four tested from quay.io/containerdisks), windows (our custom) are not working (the same error), fedora (both our custom and from quay.io/containerdisks) and opensuse are working. The same is happening on azure clusters
@aglitke, @akalenyu I just tested this on ocp 4.15 and it is happening there too.
I noticed that the download speed inside the importer pod is reduced:
Check that the cluster is fine:
oc exec -i -n openshift-monitoring prometheus-k8s-0 -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2 -o /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 906M 100 906M 0 0 96.4M 0 0:00:09 0:00:09 --:--:-- 101M
Run the same download at the same time inside a importer pod:
oc exec -i -n openshift-virtualization-os-images importer-prime-9d4f16d6-6d2b-48e4-9f9c-2d1f33159edc -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2 -o /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 906M 100 906M 0 0 3920k 0 0:03:56 0:03:56 --:--:-- 94.0M
on
OCP 4.14.0-0.nightly-2023-08-11-055332
CNV 4.14.0-1744 provided by Red Hat
@akalenyu Can you reproduce the numbers and do you have an idea why the download speed might be degraded?
I think I am hitting this trying to import a Fedora CoreOS kubevirt image from a registry. I tried it in a GCP 4.14 cluster and it fails (let me know if you want the logs). On a Bare Metal 4.13 cluster it succeeds. I know this is apples and oranges, but that last datapoint at least lets me know my containerdisk in the registry is good.
Here's the VM definition I'm using:
I'll follow along in this issue to see what the resolution is.
I think I am hitting this trying to import a Fedora CoreOS kubevirt image from a registry. I tried it in a GCP 4.14 cluster and it fails (let me know if you want the logs). On a Bare Metal 4.13 cluster it succeeds. I know this is apples and oranges, but that last datapoint at least lets me know my containerdisk in the registry is good.
Here's the VM definition I'm using:
I'll follow along in this issue to see what the resolution is.
Thanks for jumping into this issue! It would be great to see if this is indeed the same by following the importer pod logs, similarly to how the author did:
[ksimon:12:53:22~/Stažené]$ oc logs -f importer-centos-stream9-datavolume-original
...
I0810 10:53:18.868180 1 transport.go:152] File 'disk/centos-stream-9.qcow2' found in the layer
I0810 10:53:18.868397 1 util.go:191] Writing data...
E0810 11:41:35.344127 1 util.go:193] Unable to write file from dataReader: unexpected EOF
E0810 11:41:35.409811 1 transport.go:161] Error copying file: unable to write to file: unexpected EOF
The PVC yamls corresponding to the import operation would also be helpful:
I noticed that the download speed inside the importer pod is reduced:
- Check that the cluster is fine:
oc exec -i -n openshift-monitoring prometheus-k8s-0 -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2 -o /dev/null % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 906M 100 906M 0 0 96.4M 0 0:00:09 0:00:09 --:--:-- 101M
- Run the same download at the same time inside a importer pod:
oc exec -i -n openshift-virtualization-os-images importer-prime-9d4f16d6-6d2b-48e4-9f9c-2d1f33159edc -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2 -o /dev/null % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 906M 100 906M 0 0 3920k 0 0:03:56 0:03:56 --:--:-- 94.0M
on
OCP 4.14.0-0.nightly-2023-08-11-055332 CNV 4.14.0-1744 provided by Red Hat
@akalenyu Can you reproduce the numbers and do you have an idea why the download speed might be degraded?
Can't reproduce these numbers on 4.14.0-1763
:
$ oc exec -i -n openshift-monitoring prometheus-k8s-0 -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2 -o /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 906M 100 906M 0 0 255M 0 0:00:03 0:00:03 --:--:-- 255M
$ oc exec -i -n default importer-prime-9614b3bd-7e71-4306-96b2-c042efc26929 -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2 -o /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 906M 100 906M 0 0 98.3M 0 0:00:09 0:00:09 --:--:-- 102M
In general, it would be surprising to see a slowdown with mirrors, as we have recently merged and released https://github.com/kubevirt/containerized-data-importer/pull/2841 which makes us download the entire image and only then convert it.
Is it possible the pods you picked are scheduled to different nodes?
Thanks for jumping into this issue! It would be great to see if this is indeed the same by following the importer pod logs,
The logs are here: importer-fcos-data-volume-importer.txt
Unfortunately I don't still have the PVC yamls as the cluster got taken down on Friday.
Can't reproduce these numbers on 4.14.0-1763:
@akalenyu Doesn't your curl download speeds show a performance degradation by the factor 2.5 ?
@akalenyu The same behaviour as Dominik is observing is happening when DV has the pullMethod: node.
Is it possible the pods you picked are scheduled to different nodes?
I tried it and importer pod is slower than prometheus pod, eventhough pods are on the same node
[ksimon:10:14:35~]$ oc exec -i -n openshift-monitoring prometheus-k8s-1 -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2 -o /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
10 906M 10 91.2M 0 0 70.2M 0 0:00:12 0:00:01 0:00:11 70.1M
20 906M 20 185M 0 0 80.4M 0 0:00:11 0:00:02 0:00:09 80.4M
vs
[ksimon:10:15:29~]$ oc exec -i importer-prime-eda481be-064c-4a9f-967d-752cb4b3c2fb -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2 -o /dev/null
Defaulted container "importer" out of: importer, server, init (init)
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 906M 0 1366k 0 0 503k 0 0:30:44 0:00:02 0:30:42 503k
Thanks for jumping into this issue! It would be great to see if this is indeed the same by following the importer pod logs,
The logs are here: importer-fcos-data-volume-importer.txt
Unfortunately I don't still have the PVC yamls as the cluster got taken down on Friday.
Yep that's the same issue
Can't reproduce these numbers on 4.14.0-1763:
@akalenyu Doesn't your curl download speeds show a performance degradation by the factor 2.5 ?
Maybe this depends on which mirror centos.org redirects to:
$ oc exec -i -n default importer-test -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2 -o /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 906M 100 906M 0 0 63.9M 0 0:00:14 0:00:14 --:--:-- 36.2M
$ oc exec -i -n default importer-test -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2 -o /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 906M 100 906M 0 0 156M 0 0:00:05 0:00:05 --:--:-- 162M
$ oc exec -i -n default importer-test -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2 -o /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 906M 100 906M 0 0 284M 0 0:00:03 0:00:03 --:--:-- 284M
@akalenyu The same behaviour as Dominik is observing is happening when DV has the pullMethod: node.
pullMethod: node
uses the container runtime on the node to pull images, similarly to how it would
do that for regular pod images. Are those super slow too? you can try a simple crictl pull
on the node
A couple notes from the SIG-storage discussion of this issue:
- Could there be a network connectivity issue to the registry server specific to this environment?
So following our discussion in the meeting, I went digging in containers/image (the library we use to pull from registry) and noticed there was a similar issue - an RFE for retrying on unexpected EOF/reset by peer errors: https://github.com/containers/image/issues/1145#issuecomment-1437564599 Seems to specifically impact large images like our use case
I created a PR to bump this library in CDI to hopefully get this extra resiliency: https://github.com/kubevirt/containerized-data-importer/pull/2874
- Could there be a network connectivity issue to the registry server specific to this environment?
So following our discussion in the meeting, I went digging in containers/image (the library we use to pull from registry) and noticed there was a similar issue - an RFE for retrying on unexpected EOF/reset by peer errors: containers/image#1145 (comment) Seems to specifically impact large images like our use case
I created a PR to bump this library in CDI to hopefully get this extra resiliency: #2874
So I just verified this PR on a cluster-bot gcp cluster:
I0828 15:03:32.885930 1 importer.go:103] Starting importer
...
I0828 15:03:33.380544 1 util.go:194] Writing data...
time="2023-08-28T15:57:29Z" level=info msg="Reading blob body from https://quay.io/v2/containerdisks/centos-stream/blobs/sha256:ad685da39a47681aff950792a52c35c44b35d1d6e610f21cdbc9cc7494e24720 failed (unexpected EOF), reconnecting after 766851345 bytes…"
time="2023-08-28T16:34:40Z" level=info msg="Reading blob body from https://quay.io/v2/containerdisks/centos-stream/blobs/sha256:ad685da39a47681aff950792a52c35c44b35d1d6e610f21cdbc9cc7494e24720 failed (unexpected EOF), reconnecting after 157415654 bytes…"
...
I0828 17:06:46.600304 1 data-processor.go:255] New phase: Complete
I0828 17:06:46.600386 1 importer.go:216] Import Complete
You can see the retry mechanism had to kick in, and the pull is still very slow. It took more than two hours
maybe there is some issue when pulling from quay.io when in GCP?
maybe there is some issue when pulling from quay.io when in GCP?
Thought so too, but the slowness reproduces with other HTTP sources like
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
...
spec:
source:
http:
url: https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2
(Both with curl and the cdi importer process)
maybe there is some issue when pulling from quay.io when in GCP?
this is not only gcp issue, azure is affected as well
/cc
@maya-r suggested it may have to do with resource usage of the CDI importer process, and it seems like that is the case here.
We bumped (2x) the CDI default requests&limits and the import completed quickly I suggest that you also give this a try, basically, just edit the CDI resource with
apiVersion: cdi.kubevirt.io/v1beta1
kind: CDI
...
spec:
config:
featureGates:
- HonorWaitForFirstConsumer
podResourceRequirements:
limits:
cpu: 1500m
memory: 1200M
requests:
cpu: 100m
memory: 60M
I have no idea why we get throttled for a simple image pull so will have to figure that out.. One difference between 4.13 and 4.14 is cgroupsv2 being the default so throttles will happen instead of OOMs
as you can see in this PR https://github.com/kubevirt/common-templates/pull/542 many tests failed on importing DV into cluster even with the change.
/cc
@akalenyu thanks for your impressive investigation! @ksimon1 can you confirm that the issue is fixed?
Yes, issue is fixed
What happened: During run of common templates e2e tests, import of DV fails on GCP env
What you expected to happen: DV is imported without error
How to reproduce it (as minimally and precisely as possible): Run common templates e2e test - I can help setting the env
OR request new cluster via cluster bot with command
launch 4.14 gcp,virtualization-support
deploy KubeVirt, cdi and create datavolume:Environment:
kubectl get deployments cdi-deployment -o yaml
): v1.56.1[ksimon:13:03:16~/go/src/kubevirt.io/common-templates]$ oc version Client Version: 4.13.4 Kustomize Version: v4.5.7 Server Version: 4.14.0-0.nightly-2023-08-10-021647 Kubernetes Version: v1.27.4+54fa6e1
DV definition:
Log from importer pod: