kubevirt / kubevirtci

Contains cluster definitions and client tools to quickly spin up and destroy ephemeral and scalable k8s and ocp clusters for testing
Apache License 2.0
83 stars 119 forks source link

K3D on Arm64 server Verification #984

Closed zhlhahaha closed 1 year ago

zhlhahaha commented 1 year ago
zhlhahaha commented 1 year ago

Hi @oshoval , I put the tasks here, I will do a basic verification first. As both k3s and k3d are not verified on Arm64 server, I am not sure how many works need to do. Personally, I want to finish KubeVirt feature verfication and e2e tests enablement on Arm first, then I can put more effort on this.

oshoval commented 1 year ago

Hi @zhlhahaha Sure no rush

Thanks

oshoval commented 1 year ago

You know there is k3d-1.25-sriov right ? that you can use as baseline with export DEPLOY_SRIOV=false just making sure so you won't need to duplicate work

zhlhahaha commented 1 year ago

You know there is k3d-1.25-sriov right ? that you can use as baseline with export DEPLOY_SRIOV=false just making sure so you won't need to duplicate work

Yes, I give it a try yesterday, and it failed to start on Arm64 server because of the cni. I am not sure why it not works. I will take a look tomorrow.

oshoval commented 1 year ago

Thanks

Btw why don't you use vm based providers instead ? What about adapting them to support Arm64 ? might be better? k3d / kind as part of kubevirtci are more experimental than the vm based providers, dedicated to SR-IOV / vGPU. Vm based providers are also more robust. Atm we don't maintain the kind-1.23 etc, I feel we should have one path which is e2e covered, whatever it is.

brianmcarey commented 1 year ago

Btw why don't you use vm based providers instead ? What about adapting them to support Arm64 ? might be better?

As far as I remember Arm64 doesn't support nested virtualization.

oshoval commented 1 year ago

As far as I remember Arm64 doesn't support nested virtualization.

https://lwn.net/Articles/921783/ wdyt ?

zhlhahaha commented 1 year ago

As far as I remember Arm64 doesn't support nested virtualization.

https://lwn.net/Articles/921783/ wdyt ?

It is a hardware problem. Currently, most of Arm64 CPU on the market (include the Arm64 server for KubeVIrt CICD) does not support nested virtualization.

zhlhahaha commented 1 year ago

Hi @oshoval , I have verified the k3d, on Arm64 server and in a nested container environment (bootstrap image) Good news is k3d can successfully start on Arm64 server and nested container environment, and 81 E2E tests pass and 1 failed. I did following modification:

  1. use the default CNI rather than calico (any specific reason to use calico CNI?)
  2. make it run as a one node cluster (only have k3d-k3d-server-0 node)

Here are some issues:

  1. vmi-killer seems not works in the cluster
  2. tests/reporter seems not works well
  3. E2E tests failed occationally in mutli-node k3d cluster

I am still checking these issue. And I also need to verify if stability of the k3d provider.

oshoval commented 1 year ago

Hi @oshoval , I have verified the k3d, on Arm64 server and in a nested container environment (bootstrap image) Good news is k3d can successfully start on Arm64 server and nested container environment, and 81 E2E tests pass and 1 failed. I did following modification:

  1. use the default CNI rather than calico (any specific reason to use calico CNI?)

Thank you

We must install manually the CNI, let it either be Calico or Flannel, because without it, multus doesn't work. See please https://github.com/kubevirt/kubevirtci/pull/972#issuecomment-1455822191 and the comments below on cons of it (it also has mentions an unmerged commit that use it).

  1. make it run as a one node cluster (only have k3d-k3d-server-0 node)

We need at least 2 nodes (atm we use 3) because we test migration.

Due to 1 and 2 Lets have a different folder for k3d-1.25 ? It seems SR-IOV need different config. I prefer to keep the providers untangled, as all the others.

Note that once we move to cluster create using a manifest, it might be easier to maintain two different configuration. Atm there is a bug there with podman support so need some workaround (network field is broken).

Here are some issues:

  1. vmi-killer seems not works in the cluster
  2. tests/reporter seems not works well
  3. E2E tests failed occationally in mutli-node k3d cluster

I am still checking these issue. And I also need to verify if stability of the k3d provider.

Thank you

zhlhahaha commented 1 year ago

We must install manually the CNI, let it either be Calico or Flannel, because without it, multus doesn't work.

I see. I find out why calico not works on Arm64. It seems that images from quay.io/calico/ are only for x86_64. We need to pull multi-arch container images for calico.

Lets have a different folder for k3d-1.25 ?

Ok, as currently k3d-1.25 are only used by Arm64 CICD, I want make it as simple as possible.

Atm there is a bug there with podman support so need some workaround (network field is broken).

Do you have more information on this?

Thanks, @oshoval

oshoval commented 1 year ago

Atm there is a bug there with podman support so need some workaround (network field is broken).

Do you have more information on this?

https://rancher-users.slack.com/archives/CHM1EB3A7/p1679999162750929?thread_ts=1678090269.551049&cid=CHM1EB3A7

You can look here (WIP) https://github.com/oshoval/kubevirtci/commit/f0aa327efc391c60c76e24de8f93128091ed735b This is a hack that fix it locally (not on CI, because on CI we cant create this podman network) https://github.com/oshoval/kubevirtci/commit/4b281debc2fa6eadf7ebfa520f5446a7c990d8e9 I think the solution will be to remove the network field (which has the bug) and configure CI to have default network named bridge when using podman. But it will take time. I might open an issue about it for k3d.

oshoval commented 1 year ago

Note that since we don't support podman yet on CI, it might actually be disregarded atm (just support it locally), so we can use manifests, in order to have robust provider easier, assuming you are using docker.

zhlhahaha commented 1 year ago

Note that since we don't support podman yet on CI, it might actually be disregarded atm (just support it locally), so we can use manifests, in order to have robust provider easier, assuming you are using docker.

podman is now used in bootstrap image, so this is a problem.

oshoval commented 1 year ago

Do you think you can adjust it to use default network named "bridge" and then it will work for you? for us podman doesn't work at all because we have multi nodes, and on CI with netavark it doesn't work atm

Anyhow, no rush about it, we can discuss when time comes.

zhlhahaha commented 1 year ago

Hi @oshoval, I almost finish the verification for k3d on Arm64.

I still have on uncertainty. If the e2e tests passed in k3d cluster, does it means they are works well on k8s cluster? Compared with E2E tests on x86_64 server, which have all rounded tests to verify if it is work well on a standard k8s cluster, on Arm64, we only have the E2E tests in nested containerized environment. If we migrate Arm64 E2E test pipeline from kind provider to k3d provider. Are there potential risks or uncertainty?

cc: @rmohr @dhiller @qinqon @xpivarc @brianmcarey

oshoval commented 1 year ago

Hi @zhlhahaha Thanks for the effort, please see the comments on https://github.com/kubevirt/kubevirtci/pull/994 I think we should wait for proper manifests usage, meanwhile you can duplicate the files you need with different name please imo.

Well k3s is k8s compatible, It has bit different architecture than k8s but it is considered compatible, It is maintained by CNCF as sandbox stage. For sig-network we are fine with it, I don't know to tell about what needed for ARM tbh.

zhlhahaha commented 1 year ago

For sig-network we are fine with it, I don't know to tell about what needed for ARM tbh.

@rmohr @dhiller @qinqon @xpivarc @brianmcarey Do you have any suggestion?

xpivarc commented 1 year ago

What is the reason to migrate to K3D? I did not check it out, is it certified?

oshoval commented 1 year ago

The reason: https://github.com/kubernetes-sigs/kind/issues/2999 Unless you don't need cpu manager for ARM e2e testings. Current kind provider that is used is out of date (1.23).

It is CNCF certified, Sanbox phase, k8s compatible distribution CNCF maintain both k8s and k3s https://k3s.io/

K3d is a wrapper of k3s, allowing to run k3s in containers.

xpivarc commented 1 year ago

Just a note, the https://github.com/kubernetes-sigs/kind/issues/2999 is now resolved.

oshoval commented 1 year ago

Just a note, the kubernetes-sigs/kind#2999 is now resolved.

Thanks Please try SR-IOV e2e (if you want and have a provider), since the /dev/null is not for sure the same problem as open /dev/ptmx: operation not permitted: unknown (but possibly, maybe you tried a POC already, anyhow we need e2e), and I am not sure that kind is better than k3d for us (sig-network specifically), it seems k3d is developing faster atm, and pretty stable and lighter, in case we need we can go back to kind, it is good to have alternative. Note that atm it doesn't yet official and also need work, personally I have other priorities.

Btw I think we should consider a repo kubevirt - easy to start where we can have both kind and k3d with the basic recommended configuration that allows to run simple kubevirt machine. But this is other story.

Of course that if Howard prefer to keep kind he / us can update the non SR-IOV kind provider, note that we don't have e2e for ARM so we can't maintain it (we as sig-network maintain only the k3d-sriov).

zhlhahaha commented 1 year ago

Btw I think we should consider a repo kubevirt - easy to start where we can have both kind and k3d with the basic recommended configuration that allows to run simple kubevirt machine. But this is other story.

It is a good idea.

Of course that if Howard prefer to keep kind he / us can update the non SR-IOV kind provider, note that we don't have e2e for ARM so we can't maintain it (we as sig-network maintain only the k3d-sriov).

As Kind is used in kubernetes CI/CD pipeline. I think it is more reliable to verify KubeVirt in the kind k8s environment on Arm platform, so I prefer to keep the kind provider and use it in E2E tests for Arm. And I can maintain the provider.