DataVolume import is failing to import from GCP and Azure

ksimon1 commented 1 year ago

What happened: During run of common templates e2e tests, import of DV fails on GCP env

What you expected to happen: DV is imported without error

How to reproduce it (as minimally and precisely as possible): Run common templates e2e test - I can help setting the env
OR request new cluster via cluster bot with command launch 4.14 gcp,virtualization-support deploy KubeVirt, cdi and create datavolume:

oc apply -n kubevirt -f - <<EOF
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  annotations:
    cdi.kubevirt.io/storage.bind.immediate.requested: "true"
  name: centos-stream9
spec:
  source:
    registry:
      url: docker://quay.io/kubevirt/common-templates:centos-stream9
  pvc:
    accessModes:
      - ReadWriteOnce
    resources:
      requests:
        storage: 30Gi
EOF

Environment:

CDI version (use kubectl get deployments cdi-deployment -o yaml): v1.56.1

[ksimon:13:03:16~/go/src/kubevirt.io/common-templates]$ oc version Client Version: 4.13.4 Kustomize Version: v4.5.7 Server Version: 4.14.0-0.nightly-2023-08-10-021647 Kubernetes Version: v1.27.4+54fa6e1

DV definition:

[ksimon:13:03:07~/go/src/kubevirt.io/common-templates]$ oc describe datavolume centos-stream9-datavolume-original
Name:         centos-stream9-datavolume-original
Namespace:    kubevirt
Labels:       <none>
Annotations:  cdi.kubevirt.io/storage.bind.immediate.requested: true
API Version:  cdi.kubevirt.io/v1beta1
Kind:         DataVolume
Metadata:
  Creation Timestamp:  2023-08-10T10:52:55Z
  Generation:          1
  Managed Fields:
    API Version:  cdi.kubevirt.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:cdi.kubevirt.io/storage.bind.immediate.requested:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:pvc:
          .:
          f:accessModes:
          f:resources:
            .:
            f:requests:
              .:
              f:storage:
        f:source:
          .:
          f:registry:
            .:
            f:url:
    Manager:      kubectl-client-side-apply
    Operation:    Update
    Time:         2023-08-10T10:52:55Z
    API Version:  cdi.kubevirt.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:claimName:
        f:conditions:
        f:phase:
        f:progress:
    Manager:         cdi-controller
    Operation:       Update
    Subresource:     status
    Time:            2023-08-10T11:03:14Z
  Resource Version:  126758
  UID:               452699ab-ab72-4a25-988b-0e82d9fd44cf
Spec:
  Pvc:
    Access Modes:
      ReadWriteOnce
    Resources:
      Requests:
        Storage:  30Gi
  Source:
    Registry:
      URL:  docker://quay.io/kubevirt/common-templates:centos-stream9
Status:
  Claim Name:  centos-stream9-datavolume-original
  Conditions:
    Last Heartbeat Time:   2023-08-10T10:53:01Z
    Last Transition Time:  2023-08-10T10:53:01Z
    Message:               PVC centos-stream9-datavolume-original Bound
    Reason:                Bound
    Status:                True
    Type:                  Bound
    Last Heartbeat Time:   2023-08-10T11:03:14Z
    Last Transition Time:  2023-08-10T10:52:55Z
    Reason:                TransferRunning
    Status:                False
    Type:                  Ready
    Last Heartbeat Time:   2023-08-10T10:53:14Z
    Last Transition Time:  2023-08-10T10:53:14Z
    Reason:                Pod is running
    Status:                True
    Type:                  Running
  Phase:                   ImportInProgress
  Progress:                N/A
Events:
  Type    Reason            Age   From                          Message
  ----    ------            ----  ----                          -------
  Normal  Pending           10m   datavolume-import-controller  PVC centos-stream9-datavolume-original Pending
  Normal  Claim Pending     10m   datavolume-import-controller  target PVC centos-stream9-datavolume-original Pending and Claim Pending
  Normal  ImportScheduled   10m   datavolume-import-controller  Import into centos-stream9-datavolume-original scheduled
  Normal  Claim Pending     10m   datavolume-import-controller  Claim Pending
  Normal  Bound             10m   datavolume-import-controller  PVC centos-stream9-datavolume-original Bound
  Normal  ImportInProgress  10m   datavolume-import-controller  Import into centos-stream9-datavolume-original in progress

Log from importer pod:

[ksimon:12:53:22~/Stažené]$ oc logs -f importer-centos-stream9-datavolume-original
I0810 10:53:14.498944       1 importer.go:103] Starting importer
I0810 10:53:14.498982       1 importer.go:168] begin import process
I0810 10:53:14.499040       1 registry-datasource.go:172] Copying proxy certs
I0810 10:53:14.499058       1 registry-datasource.go:57] Error creating allCertDir open /proxycerts/: no such file or directory
I0810 10:53:14.499099       1 data-processor.go:379] Calculating available size
I0810 10:53:14.499118       1 data-processor.go:391] Checking out file system volume size.
I0810 10:53:14.499127       1 data-processor.go:399] Request image size not empty.
I0810 10:53:14.499141       1 data-processor.go:404] Target size 31509590016.
I0810 10:53:14.499226       1 util.go:38] deleting file: /scratch/lost+found
I0810 10:53:14.499829       1 util.go:38] deleting file: /data/lost+found
I0810 10:53:14.500146       1 data-processor.go:282] New phase: TransferScratch
I0810 10:53:14.500227       1 registry-datasource.go:92] Copying registry image to scratch space.
I0810 10:53:14.500233       1 transport.go:176] Downloading image from 'docker://quay.io/kubevirt/common-templates:centos-stream9', copying file from 'disk' to '/scratch'
I0810 10:53:15.108005       1 transport.go:200] Processing layer {Digest:sha256:8f6ac7ed4a91c9630083524efcef2f59f27404320bfee44397f544c252ad4bd4 Size:91270228 URLs:[] Annotations:map[] MediaType:application/vnd.oci.image.layer.v1.tar+gzip CompressionOperation:0 CompressionAlgorithm:<nil> CryptoOperation:0}
I0810 10:53:17.902869       1 transport.go:200] Processing layer {Digest:sha256:219eea7f39e324e3014bed9c8bb5eb893731fe04e76bda10d0f6f1c41d3e428f Size:3606081 URLs:[] Annotations:map[] MediaType:application/vnd.oci.image.layer.v1.tar+gzip CompressionOperation:0 CompressionAlgorithm:<nil> CryptoOperation:0}
I0810 10:53:18.255363       1 transport.go:200] Processing layer {Digest:sha256:a61b3233d0cb45893335986c0c4faa3da5ec5369d20d696d20b376084c02a3e0 Size:1283 URLs:[] Annotations:map[] MediaType:application/vnd.oci.image.layer.v1.tar+gzip CompressionOperation:0 CompressionAlgorithm:<nil> CryptoOperation:0}
I0810 10:53:18.462437       1 transport.go:200] Processing layer {Digest:sha256:6736fe12ce1296f721c6676a09ff55dfa719e9335f48882768b34ea52a29b475 Size:943688263 URLs:[] Annotations:map[] MediaType:application/vnd.oci.image.layer.v1.tar+gzip CompressionOperation:0 CompressionAlgorithm:<nil> CryptoOperation:0}
I0810 10:53:18.868180       1 transport.go:152] File 'disk/centos-stream-9.qcow2' found in the layer
I0810 10:53:18.868397       1 util.go:191] Writing data...
E0810 11:41:35.344127       1 util.go:193] Unable to write file from dataReader: unexpected EOF
E0810 11:41:35.409811       1 transport.go:161] Error copying file: unable to write to file: unexpected EOF
E0810 11:41:35.409873       1 transport.go:214] Failed to find VM disk image file in the container image
E0810 11:41:35.409898       1 data-processor.go:278] Failed to find VM disk image file in the container image
kubevirt.io/containerized-data-importer/pkg/importer.copyRegistryImage
    pkg/importer/transport.go:215
kubevirt.io/containerized-data-importer/pkg/importer.CopyRegistryImage
    pkg/importer/transport.go:262
kubevirt.io/containerized-data-importer/pkg/importer.(*RegistryDataSource).Transfer
    pkg/importer/registry-datasource.go:93
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).initDefaultPhases.func2
    pkg/importer/data-processor.go:208
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
    pkg/importer/data-processor.go:275
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
    pkg/importer/data-processor.go:184
main.handleImport
    cmd/cdi-importer/importer.go:174
main.main
    cmd/cdi-importer/importer.go:140
runtime.main
    GOROOT/src/runtime/proc.go:250
runtime.goexit
    GOROOT/src/runtime/asm_amd64.s:1571
Failed to read registry image
kubevirt.io/containerized-data-importer/pkg/importer.(*RegistryDataSource).Transfer
    pkg/importer/registry-datasource.go:95
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).initDefaultPhases.func2
    pkg/importer/data-processor.go:208
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
    pkg/importer/data-processor.go:275
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
    pkg/importer/data-processor.go:184
main.handleImport
    cmd/cdi-importer/importer.go:174
main.main
    cmd/cdi-importer/importer.go:140
runtime.main
    GOROOT/src/runtime/proc.go:250
runtime.goexit
    GOROOT/src/runtime/asm_amd64.s:1571
Unable to transfer source data to scratch space
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).initDefaultPhases.func2
    pkg/importer/data-processor.go:213
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
    pkg/importer/data-processor.go:275
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
    pkg/importer/data-processor.go:184
main.handleImport
    cmd/cdi-importer/importer.go:174
main.main
    cmd/cdi-importer/importer.go:140
runtime.main
    GOROOT/src/runtime/proc.go:250
runtime.goexit
    GOROOT/src/runtime/asm_amd64.s:1571
I0810 11:41:35.410061       1 util.go:38] deleting file: /scratch/disk
E0810 11:41:35.410113       1 importer.go:177] Failed to find VM disk image file in the container image
kubevirt.io/containerized-data-importer/pkg/importer.copyRegistryImage
    pkg/importer/transport.go:215
kubevirt.io/containerized-data-importer/pkg/importer.CopyRegistryImage
    pkg/importer/transport.go:262
kubevirt.io/containerized-data-importer/pkg/importer.(*RegistryDataSource).Transfer
    pkg/importer/registry-datasource.go:93
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).initDefaultPhases.func2
    pkg/importer/data-processor.go:208
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
    pkg/importer/data-processor.go:275
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
    pkg/importer/data-processor.go:184
main.handleImport
    cmd/cdi-importer/importer.go:174
main.main
    cmd/cdi-importer/importer.go:140
runtime.main
    GOROOT/src/runtime/proc.go:250
runtime.goexit
    GOROOT/src/runtime/asm_amd64.s:1571
Failed to read registry image
kubevirt.io/containerized-data-importer/pkg/importer.(*RegistryDataSource).Transfer
    pkg/importer/registry-datasource.go:95
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).initDefaultPhases.func2
    pkg/importer/data-processor.go:208
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
    pkg/importer/data-processor.go:275
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
    pkg/importer/data-processor.go:184
main.handleImport
    cmd/cdi-importer/importer.go:174
main.main
    cmd/cdi-importer/importer.go:140
runtime.main
    GOROOT/src/runtime/proc.go:250
runtime.goexit
    GOROOT/src/runtime/asm_amd64.s:1571
Unable to transfer source data to scratch space
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).initDefaultPhases.func2
    pkg/importer/data-processor.go:213
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
    pkg/importer/data-processor.go:275
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
    pkg/importer/data-processor.go:184
main.handleImport
    cmd/cdi-importer/importer.go:174
main.main
    cmd/cdi-importer/importer.go:140
runtime.main
    GOROOT/src/runtime/proc.go:250
runtime.goexit
    GOROOT/src/runtime/asm_amd64.s:1571

codingben commented 1 year ago

ksimon1 commented 1 year ago

I can't say, because this error is happening during first import of disk from quay into cluster. The #2836 is happening during clone of the imported disk (the one about which is this bug) to new dataVolume

akalenyu commented 1 year ago

Did anything change in the push process of the common template images? Usually, the disk image file is 107:107 but in this case, I see BTW, why do you need the other files in the container image? (boot/,dev/ etc)

ksimon1 commented 1 year ago

The images are created the same method as in the past:

FROM kubevirt/container-disk-v1alpha
ADD centos-stream-9.qcow2 /disk

aglitke commented 1 year ago

Are you able to reproduce this with the standard containerDisk images (ie. quay.io/containerdisks/centos-stream:9)? This definitely seems like an issue with this specific containerDisk.

ksimon1 commented 1 year ago

yes, centos stream 8, 9, ubuntu, rhcos (all four tested from quay.io/containerdisks), windows (our custom) are not working (the same error), fedora (both our custom and from quay.io/containerdisks) and opensuse are working. The same is happening on azure clusters

ksimon1 commented 1 year ago

@aglitke, @akalenyu I just tested this on ocp 4.15 and it is happening there too.

dominikholler commented 1 year ago

I noticed that the download speed inside the importer pod is reduced:

Check that the cluster is fine:

oc exec -i -n openshift-monitoring prometheus-k8s-0 -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                             Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0  96.4M      0  0:00:09  0:00:09 --:--:--  101M

Run the same download at the same time inside a importer pod:

oc exec -i -n openshift-virtualization-os-images importer-prime-9d4f16d6-6d2b-48e4-9f9c-2d1f33159edc -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                             Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0  3920k      0  0:03:56  0:03:56 --:--:-- 94.0M

on

OCP 4.14.0-0.nightly-2023-08-11-055332
CNV 4.14.0-1744 provided by Red Hat

@akalenyu Can you reproduce the numbers and do you have an idea why the download speed might be degraded?

dustymabe commented 1 year ago

I think I am hitting this trying to import a Fedora CoreOS kubevirt image from a registry. I tried it in a GCP 4.14 cluster and it fails (let me know if you want the logs). On a Bare Metal 4.13 cluster it succeeds. I know this is apples and oranges, but that last datapoint at least lets me know my containerdisk in the registry is good.

Here's the VM definition I'm using:

``` --- apiVersion: kubevirt.io/v1 kind: VirtualMachine metadata: name: fcos spec: runStrategy: Always dataVolumeTemplates: - metadata: name: fcos-data-volume spec: storage: volumeMode: Block resources: requests: storage: 10Gi source: registry: url: "docker://quay.io/fedora/fedora-coreos-kubevirt:stable" template: metadata: creationTimestamp: null spec: domain: devices: disks: - disk: bus: virtio name: fcos-data-volume-disk1 - disk: bus: virtio name: cloudinitdisk rng: {} resources: requests: memory: 2048M volumes: - dataVolume: name: fcos-data-volume name: fcos-data-volume-disk1 - name: cloudinitdisk cloudInitConfigDrive: userData: | { "ignition": { "version": "3.3.0" }, "passwd": { "users": [ { "name": "core", "sshAuthorizedKeys": [ "ecdsa-sha2-nistp521 AAAA..." ] } ] } } ```

I'll follow along in this issue to see what the resolution is.

akalenyu commented 1 year ago

I think I am hitting this trying to import a Fedora CoreOS kubevirt image from a registry. I tried it in a GCP 4.14 cluster and it fails (let me know if you want the logs). On a Bare Metal 4.13 cluster it succeeds. I know this is apples and oranges, but that last datapoint at least lets me know my containerdisk in the registry is good.

Here's the VM definition I'm using:

I'll follow along in this issue to see what the resolution is.

Thanks for jumping into this issue! It would be great to see if this is indeed the same by following the importer pod logs, similarly to how the author did:

[ksimon:12:53:22~/Stažené]$ oc logs -f importer-centos-stream9-datavolume-original
...
I0810 10:53:18.868180       1 transport.go:152] File 'disk/centos-stream-9.qcow2' found in the layer
I0810 10:53:18.868397       1 util.go:191] Writing data...
E0810 11:41:35.344127       1 util.go:193] Unable to write file from dataReader: unexpected EOF
E0810 11:41:35.409811       1 transport.go:161] Error copying file: unable to write to file: unexpected EOF

The PVC yamls corresponding to the import operation would also be helpful:

fcos-data-volume
fcos-data-volume-scratch

akalenyu commented 1 year ago

I noticed that the download speed inside the importer pod is reduced:

Check that the cluster is fine:

oc exec -i -n openshift-monitoring prometheus-k8s-0 -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0  96.4M      0  0:00:09  0:00:09 --:--:--  101M

Run the same download at the same time inside a importer pod:

oc exec -i -n openshift-virtualization-os-images importer-prime-9d4f16d6-6d2b-48e4-9f9c-2d1f33159edc -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0  3920k      0  0:03:56  0:03:56 --:--:-- 94.0M

on

OCP 4.14.0-0.nightly-2023-08-11-055332
CNV 4.14.0-1744 provided by Red Hat

@akalenyu Can you reproduce the numbers and do you have an idea why the download speed might be degraded?

Can't reproduce these numbers on 4.14.0-1763:

$ oc exec -i -n openshift-monitoring prometheus-k8s-0 -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0   255M      0  0:00:03  0:00:03 --:--:--  255M

$ oc exec -i -n default importer-prime-9614b3bd-7e71-4306-96b2-c042efc26929 -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0  98.3M      0  0:00:09  0:00:09 --:--:--  102M

In general, it would be surprising to see a slowdown with mirrors, as we have recently merged and released https://github.com/kubevirt/containerized-data-importer/pull/2841 which makes us download the entire image and only then convert it.

Is it possible the pods you picked are scheduled to different nodes?

dustymabe commented 1 year ago

Thanks for jumping into this issue! It would be great to see if this is indeed the same by following the importer pod logs,

The logs are here: importer-fcos-data-volume-importer.txt

Unfortunately I don't still have the PVC yamls as the cluster got taken down on Friday.

dominikholler commented 1 year ago

Can't reproduce these numbers on 4.14.0-1763:

@akalenyu Doesn't your curl download speeds show a performance degradation by the factor 2.5 ?

ksimon1 commented 1 year ago

@akalenyu The same behaviour as Dominik is observing is happening when DV has the pullMethod: node.

Is it possible the pods you picked are scheduled to different nodes?

I tried it and importer pod is slower than prometheus pod, eventhough pods are on the same node

[ksimon:10:14:35~]$ oc exec -i -n openshift-monitoring prometheus-k8s-1 -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 10  906M   10 91.2M    0     0  70.2M      0  0:00:12  0:00:01  0:00:11 70.1M
 20  906M   20  185M    0     0  80.4M      0  0:00:11  0:00:02  0:00:09 80.4M

vs

[ksimon:10:15:29~]$ oc exec -i  importer-prime-eda481be-064c-4a9f-967d-752cb4b3c2fb -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
Defaulted container "importer" out of: importer, server, init (init)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0  906M    0 1366k    0     0   503k      0  0:30:44  0:00:02  0:30:42  503k

akalenyu commented 1 year ago

Thanks for jumping into this issue! It would be great to see if this is indeed the same by following the importer pod logs,

The logs are here: importer-fcos-data-volume-importer.txt

Unfortunately I don't still have the PVC yamls as the cluster got taken down on Friday.

Yep that's the same issue

Can't reproduce these numbers on 4.14.0-1763:

@akalenyu Doesn't your curl download speeds show a performance degradation by the factor 2.5 ?

Maybe this depends on which mirror centos.org redirects to:

$ oc exec -i -n default importer-test -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0  63.9M      0  0:00:14  0:00:14 --:--:-- 36.2M
$ oc exec -i -n default importer-test -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0   156M      0  0:00:05  0:00:05 --:--:--  162M
$ oc exec -i -n default importer-test -- curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2   -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  906M  100  906M    0     0   284M      0  0:00:03  0:00:03 --:--:--  284M

akalenyu commented 1 year ago

@akalenyu The same behaviour as Dominik is observing is happening when DV has the pullMethod: node.

pullMethod: node uses the container runtime on the node to pull images, similarly to how it would do that for regular pod images. Are those super slow too? you can try a simple crictl pull on the node

aglitke commented 1 year ago

A couple notes from the SIG-storage discussion of this issue:

Could it be a host caching issue? We may want to test adding an fsync call after the container image is extracted to scratch space
Need to scrutinize the flow. Could there be an nbdkit issue? How is qemu-img convert involved?
Could there be a network connectivity issue to the registry server specific to this environment?
From the node, try a skopeo pull to see if the image can be retrieved (you should be able to provide credentials on the command line)
Can the same image be pushed to a registry other than quay.io to see if it's related to that registry.

akalenyu commented 1 year ago

Could there be a network connectivity issue to the registry server specific to this environment?

So following our discussion in the meeting, I went digging in containers/image (the library we use to pull from registry) and noticed there was a similar issue - an RFE for retrying on unexpected EOF/reset by peer errors: https://github.com/containers/image/issues/1145#issuecomment-1437564599 Seems to specifically impact large images like our use case

I created a PR to bump this library in CDI to hopefully get this extra resiliency: https://github.com/kubevirt/containerized-data-importer/pull/2874

akalenyu commented 1 year ago

Could there be a network connectivity issue to the registry server specific to this environment?

So following our discussion in the meeting, I went digging in containers/image (the library we use to pull from registry) and noticed there was a similar issue - an RFE for retrying on unexpected EOF/reset by peer errors: containers/image#1145 (comment) Seems to specifically impact large images like our use case

I created a PR to bump this library in CDI to hopefully get this extra resiliency: #2874

So I just verified this PR on a cluster-bot gcp cluster:

I0828 15:03:32.885930       1 importer.go:103] Starting importer
...
I0828 15:03:33.380544       1 util.go:194] Writing data...
time="2023-08-28T15:57:29Z" level=info msg="Reading blob body from https://quay.io/v2/containerdisks/centos-stream/blobs/sha256:ad685da39a47681aff950792a52c35c44b35d1d6e610f21cdbc9cc7494e24720 failed (unexpected EOF), reconnecting after 766851345 bytes…"
time="2023-08-28T16:34:40Z" level=info msg="Reading blob body from https://quay.io/v2/containerdisks/centos-stream/blobs/sha256:ad685da39a47681aff950792a52c35c44b35d1d6e610f21cdbc9cc7494e24720 failed (unexpected EOF), reconnecting after 157415654 bytes…"
...
I0828 17:06:46.600304       1 data-processor.go:255] New phase: Complete
I0828 17:06:46.600386       1 importer.go:216] Import Complete

You can see the retry mechanism had to kick in, and the pull is still very slow. It took more than two hours

dustymabe commented 1 year ago

maybe there is some issue when pulling from quay.io when in GCP?

akalenyu commented 1 year ago

maybe there is some issue when pulling from quay.io when in GCP?

Thought so too, but the slowness reproduces with other HTTP sources like

apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
...
spec:
  source:
    http:
      url: https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-9-20220829.0.x86_64.qcow2

(Both with curl and the cdi importer process)

ksimon1 commented 1 year ago

maybe there is some issue when pulling from quay.io when in GCP?

this is not only gcp issue, azure is affected as well

lyarwood commented 1 year ago

/cc

akalenyu commented 1 year ago

@maya-r suggested it may have to do with resource usage of the CDI importer process, and it seems like that is the case here.

We bumped (2x) the CDI default requests&limits and the import completed quickly I suggest that you also give this a try, basically, just edit the CDI resource with

apiVersion: cdi.kubevirt.io/v1beta1
kind: CDI
...
spec:
  config:
    featureGates:
    - HonorWaitForFirstConsumer
    podResourceRequirements:
      limits:
        cpu: 1500m
        memory: 1200M
      requests:
        cpu: 100m
        memory: 60M

I have no idea why we get throttled for a simple image pull so will have to figure that out.. One difference between 4.13 and 4.14 is cgroupsv2 being the default so throttles will happen instead of OOMs

ksimon1 commented 1 year ago

as you can see in this PR https://github.com/kubevirt/common-templates/pull/542 many tests failed on importing DV into cluster even with the change.

jcanocan commented 1 year ago

/cc

dominikholler commented 1 year ago

@akalenyu thanks for your impressive investigation! @ksimon1 can you confirm that the issue is fixed?

ksimon1 commented 11 months ago

Yes, issue is fixed

kubevirt / containerized-data-importer

DataVolume import is failing to import from GCP and Azure #2838