openshift / microshift

A small form factor OpenShift/Kubernetes optimized for edge computing
https://microshift.io
Apache License 2.0
668 stars 196 forks source link

NP-646: MicroShift should not cause the host IP to change on startup #1061

Closed adelton closed 1 year ago

adelton commented 1 year ago

What happened?

I run the steps at https://microshift.io/docs/getting-started/.

What did you expect to happen?

I expected oc get pods -A and oc get nodes to show some pods and nodes. Instead they both report No resources found.

How to reproduce it (as minimally and precisely as possible)?

  1. Have a fresh Fedora 36 machine with just @core group installed (I used one in beaker).
  2. # dnf module enable -y cri-o:1.21 ; dnf install -y cri-o cri-tools
  3. # systemctl enable crio --now
  4. # dnf copr enable -y @redhat-et/microshift
  5. # dnf install -y microshift
  6. I skipped the firewalld steps here because firewalld was not running on my system.
  7. # systemctl enable microshift --now
  8. # curl -O https://mirror.openshift.com/pub/openshift-v4/$(uname -m)/clients/ocp/stable/openshift-client-linux.tar.gz
  9. # tar -xf openshift-client-linux.tar.gz -C /usr/local/bin oc kubectl
  10. # mkdir ~/.kube ; ln -s /var/lib/microshift/resources/kubeadmin/kubeconfig ~/.kube/config
  11. # oc get pods -A

Anything else we need to know?

systemctl status microshift shows

● microshift.service - MicroShift
     Loaded: loaded (/usr/lib/systemd/system/microshift.service; enabled; vendor preset: disabled)
     Active: active (running) since Fri 2022-10-28 16:45:18 CEST; 36s ago
   Main PID: 3708 (microshift)
      Tasks: 9 (limit: 3451)
     Memory: 428.9M
        CPU: 9.181s
     CGroup: /system.slice/microshift.service
             └─ 3708 microshift run

Oct 28 16:45:51 machine.example.com microshift[3708]: E1028 16:45:51.958733    3708 available_controller.go:508] v1.apps.openshift.io failed with: failing or missing resp>
Oct 28 16:45:51 machine.example.com microshift[3708]: E1028 16:45:51.961777    3708 available_controller.go:508] v1.project.openshift.io failed with: failing or missing r>
Oct 28 16:45:51 machine.example.com microshift[3708]: E1028 16:45:51.961780    3708 available_controller.go:508] v1.build.openshift.io failed with: failing or missing res>
Oct 28 16:45:51 machine.example.com microshift[3708]: E1028 16:45:51.962059    3708 available_controller.go:508] v1.template.openshift.io failed with: failing or missing >
Oct 28 16:45:51 machine.example.com microshift[3708]: E1028 16:45:51.962105    3708 available_controller.go:508] v1.route.openshift.io failed with: failing or missing res>
Oct 28 16:45:51 machine.example.com microshift[3708]: E1028 16:45:51.962173    3708 available_controller.go:508] v1.image.openshift.io failed with: failing or missing res>
Oct 28 16:45:52 machine.example.com microshift[3708]: E1028 16:45:52.074561    3708 available_controller.go:508] v1.user.openshift.io failed with: failing or missing resp>
Oct 28 16:45:53 machine.example.com microshift[3708]: E1028 16:45:53.823184    3708 reflector.go:138] github.com/openshift/client-go/image/informers/externalversions/fact>
Oct 28 16:45:54 machine.example.com microshift[3708]: I1028 16:45:54.214773    3708 crd.go:164] Applied openshift CRD assets/crd/0000_10_config-operator_01_image.crd.yaml
Oct 28 16:45:54 machine.example.com microshift[3708]: I1028 16:45:54.214785    3708 crd.go:153] Applying openshift CRD assets/crd/0000_03_config-operator_01_proxy.crd.yaml

Assuming the clues are in some previous error journal entries with "E*" designation, the first microshift one is

Oct 28 16:31:25 machine.example.com microshift[2632]: E1028 16:31:25.046613    2632 controller.go:152] Unable to remove old endpoints from kubernetes service: StorageError: key not found, Code: 1, Key: /registry/masterleases/10.43.140.11, ResourceVersion: 0, AdditionalErrorMsg:

and then

Oct 28 16:31:27 machine.example.com microshift[2632]: E1028 16:31:27.239775    2632 reflector.go:138] github.com/openshift/openshift-controller-manager/pkg/unidling/controller/unidling_controller.go:221: Failed to watch *v1.Event: failed to list *v1.Event: events is forbidden: User "system:serviceaccount:openshift-infra:unidling-controller" cannot list resource "events" in API group "" at the cluster scope

and then a stream of

Oct 28 16:31:27 machine.example.com microshift[2632]: E1028 16:31:27.288753    2632 reflector.go:138] github.com/openshift/client-go/operator/informers/externalversions/factory.go:101: Failed to watch *v1alpha1.ImageContentSourcePolicy: failed to list *v1alpha1.ImageContentSourcePolicy: the server could not find the requested resource (get imagecontentsourcepolicies.operator.openshift.io)
Oct 28 16:31:27 machine.example.com microshift[2632]: E1028 16:31:27.309168    2632 reflector.go:138] github.com/openshift/client-go/apps/informers/externalversions/factory.go:101: Failed to watch *v1.DeploymentConfig: failed to list *v1.DeploymentConfig: the server could not find the requested resource (get deploymentconfigs.apps.openshift.io)
Oct 28 16:31:27 machine.example.com microshift[2632]: E1028 16:31:27.309193    2632 reflector.go:138] github.com/openshift/client-go/build/informers/externalversions/factory.go:101: Failed to watch *v1.Build: failed to list *v1.Build: the server could not find the requested resource (get builds.build.openshift.io)
Oct 28 16:31:27 machine.example.com microshift[2632]: E1028 16:31:27.309211    2632 reflector.go:138] github.com/openshift/client-go/build/informers/externalversions/factory.go:101: Failed to watch *v1.BuildConfig: failed to list *v1.BuildConfig: the server could not find the requested resource (get buildconfigs.build.openshift.io)
Oct 28 16:31:27 machine.example.com microshift[2632]: E1028 16:31:27.309228    2632 reflector.go:138] github.com/openshift/client-go/config/informers/externalversions/factory.go:101: Failed to watch *v1.Build: failed to list *v1.Build: the server could not find the requested resource (get builds.config.openshift.io)
Oct 28 16:31:27 machine.example.com microshift[2632]: E1028 16:31:27.309244    2632 reflector.go:138] github.com/openshift/client-go/config/informers/externalversions/factory.go:101: Failed to watch *v1.Proxy: failed to list *v1.Proxy: the server could not find the requested resource (get proxies.config.openshift.io)
Oct 28 16:31:27 machine.example.com microshift[2632]: E1028 16:31:27.309260    2632 reflector.go:138] github.com/openshift/client-go/config/informers/externalversions/factory.go:101: Failed to watch *v1.Image: failed to list *v1.Image: the server could not find the requested resource (get images.config.openshift.io)
Oct 28 16:31:27 machine.example.com microshift[2632]: E1028 16:31:27.309275    2632 reflector.go:138] github.com/openshift/client-go/image/informers/externalversions/factory.go:101: Failed to watch *v1.ImageStream: failed to list *v1.ImageStream: the server could not find the requested resource (get imagestreams.image.openshift.io)
Oct 28 16:31:27 machine.example.com microshift[2632]: E1028 16:31:27.309290    2632 reflector.go:138] github.com/openshift/client-go/image/informers/externalversions/factory.go:101: Failed to watch *v1.Image: failed to list *v1.Image: the server could not find the requested resource (get images.image.openshift.io)
Oct 28 16:31:27 machine.example.com microshift[2632]: E1028 16:31:27.309307    2632 reflector.go:138] github.com/openshift/client-go/template/informers/externalversions/factory.go:101: Failed to watch *v1.TemplateInstance: failed to list *v1.TemplateInstance: the server could not find the requested resource (get templateinstances.template.openshift.io)
Oct 28 16:31:27 machine.example.com microshift[2632]: E1028 16:31:27.309322    2632 reflector.go:138] github.com/openshift/client-go/route/informers/externalversions/factory.go:101: Failed to watch *v1.Route: failed to list *v1.Route: the server could not find the requested resource (get routes.route.openshift.io)

Environment

Relevant logs

ggiguash commented 1 year ago

@adelton, the microshift.io site contains references to the old code. Is there a reason you cannot try the instructions at https://github.com/openshift/microshift?

adelton commented 1 year ago

@ggiguash Do you have https://github.com/openshift/microshift/blob/main/docs/getting_started.md in mind? That seems to focus on running the MicroShift as a VM via virt-install and a kickstart, rather than deploying on existing RHEL or Fedora machine via rpm/dnf package installations. I don't like being forced to these types of VM installations, one reason being that they are hard to automate with beaker because I won't have the harness on that VM.

Is there a getting-started document at https://github.com/openshift/microshift which describes installation of configuration of MicroShift using standard have machine + enable repo(s) + install packages + do some configuration and run services workflow, similar to https://microshift.io/docs/getting-started/?

ggiguash commented 1 year ago

Is there a getting-started document at https://github.com/openshift/microshift which describes installation of configuration of MicroShift using standard have machine + enable repo(s) + install packages + do some configuration and run services workflow, similar to https://microshift.io/docs/getting-started/?

Yes, see this page for detailed description on how to configure a devenv

adelton commented 1 year ago

My goal is to consume rpm-built MicroShift, on a given RHEL or CentOS or Fedora machine, not really build from sources.

So I tried the steps from https://raw.githubusercontent.com/openshift/microshift/main/docs/config/microshift-starter.ks, basically using RHEL 8.6 and running

# CENTOS8BASE=http://mirror.centos.org/centos/8-stream/BaseOS/x86_64/os/Packages
# curl -LO -s $CENTOS8BASE/selinux-policy-3.14.3-96.el8.noarch.rpm
# curl -LO -s $CENTOS8BASE/selinux-policy-devel-3.14.3-96.el8.noarch.rpm
# curl -LO -s $CENTOS8BASE/selinux-policy-targeted-3.14.3-96.el8.noarch.rpm
# dnf localinstall -y selinux-policy*.rpm
# dnf copr enable -y @redhat-et/microshift-testing
# dnf install -y microshift
# systemctl enable microshift --now

The terminal (ssh) gets stuck eventually. The journalctl -fl ends with

Oct 31 14:04:51 machine.example.com microshift[17919]: kubelet E1031 14:04:51.873013   17919 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"dns-node-resolver\" with ErrImagePull: \"rpc error: code = Unknown desc = reading manifest sha256:4d182d11a30e6c3c1420502bec5b1192c43c32977060c4def96ea160172f71e7 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized\"" pod="openshift-dns/node-resolver-45796" podUID=63e75c1a-9689-45db-b646-6eea0a58ed25
Oct 31 14:04:51 machine.example.com microshift[17919]: kubelet E1031 14:04:51.874355   17919 pod_workers.go:951] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/cni/net.d/. Has your network provider started?" pod="openshift-dns/dns-default-cnq7k" podUID=e66a2aa8-c940-46ce-8ab7-ddbb92310491
Oct 31 14:04:51 machine.example.com ovs-vsctl[18532]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --timeout=15 set Open_vSwitch . "external_ids:ovn-remote=\"unix:/var/run/ovn/ovnsb_db.sock\""
Oct 31 14:04:51 machine.example.com ovs-vsctl[18533]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --timeout=15 set Open_vSwitch . external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=10.43.140.16 external_ids:ovn-remote-probe-interval=180000 external_ids:ovn-openflow-probe-interval=180 "external_ids:hostname=\"machine.example.com\"" external_ids:ovn-monitor-all=true external_ids:ovn-ofctrl-wait-before-clear=0 external_ids:ovn-enable-lflow-cache=false external_ids:ovn-memlimit-lflow-cache-kb=870
Oct 31 14:04:51 machine.example.com ovs-vsctl[18534]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --timeout=15 -- clear bridge br-int netflow -- clear bridge br-int sflow -- clear bridge br-int ipfix
Oct 31 14:04:51 machine.example.com ovs-vsctl[18536]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --timeout=15 -- --if-exists del-port br-int k8s-machine.example -- --may-exist add-port br-int ovn-k8s-mp0 -- set interface ovn-k8s-mp0 type=internal mtu_request=1400 external-ids:iface-id=k8s-machine.example.com
Oct 31 14:04:51 machine.example.com NetworkManager[15509]: <info>  [1667221491.9829] manager: (ovn-k8s-mp0): new Open vSwitch Interface device (/org/freedesktop/NetworkManager/Devices/10)
Oct 31 14:04:51 machine.example.com NetworkManager[15509]: <info>  [1667221491.9832] device (ovn-k8s-mp0): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')
Oct 31 14:04:51 machine.example.com NetworkManager[15509]: <info>  [1667221491.9835] manager: (ovn-k8s-mp0): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/11)
Oct 31 14:04:51 machine.example.com NetworkManager[15509]: <info>  [1667221491.9837] device (ovn-k8s-mp0): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'managed')
Oct 31 14:04:51 machine.example.com kernel: device ovn-k8s-mp0 entered promiscuous mode
Oct 31 14:04:51 machine.example.com systemd-udevd[18539]: Using default interface naming scheme 'rhel-8.0'.
Oct 31 14:04:51 machine.example.com systemd-udevd[18539]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Oct 31 14:04:51 machine.example.com systemd-udevd[18539]: Could not generate persistent MAC address for ovn-k8s-mp0: No such file or directory
Oct 31 14:04:51 machine.example.com ovs-vsctl[18543]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --timeout=15 set interface ovn-k8s-mp0 "mac=6e\\:29\\:33\\:8c\\:01\\:d4"
Oct 31 14:04:52 machine.example.com NetworkManager[15509]: <info>  [1667221492.0000] device (ovn-k8s-mp0): carrier: link connected
Oct 31 14:04:52 machine.example.com ovs-vsctl[18565]: ovs|00001|db_ctl_base|ERR|no port named br-ex
Oct 31 14:04:52 machine.example.com ovs-vsctl[18573]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --timeout=15 set Open_vSwitch . external_ids:ovn-bridge-mappings=physnet:br-ex
Oct 31 14:04:52 machine.example.com NetworkManager[15509]: <info>  [1667221492.4837] manager: (patch-br-int-to-br-ex_machine.example.com): new Open vSwitch Interface device (/org/freedesktop/NetworkManager/Devices/12)
Oct 31 14:04:52 machine.example.com NetworkManager[15509]: <info>  [1667221492.4839] device (patch-br-int-to-br-ex_machine.example.com): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')
Oct 31 14:04:52 machine.example.com NetworkManager[15509]: <info>  [1667221492.4841] manager: (patch-br-ex_machine.example.com-to-br-int): new Open vSwitch Interface device (/org/freedesktop/NetworkManager/Devices/13)
Oct 31 14:04:52 machine.example.com NetworkManager[15509]: <info>  [1667221492.4842] device (patch-br-ex_machine.example.com-to-br-int): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')
Oct 31 14:04:52 machine.example.com NetworkManager[15509]: <info>  [1667221492.4845] manager: (patch-br-int-to-br-ex_machine.example.com): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/14)
Oct 31 14:04:52 machine.example.com NetworkManager[15509]: <info>  [1667221492.4846] manager: (patch-br-ex_machine.example.com-to-br-int): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/15)
Oct 31 14:04:52 machine.example.com NetworkManager[15509]: <info>  [1667221492.4848] device (patch-br-int-to-br-ex_machine.example.com): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'managed')
Oct 31 14:04:52 machine.example.com NetworkManager[15509]: <info>  [1667221492.4849] device (patch-br-ex_machine.example.com-to-br-int): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'managed')

So it seems like starting the microshift service from @redhat-et/microshift-testing messes the networking on the machine.

dhellmann commented 1 year ago

The very first message in the output:

Oct 31 14:04:51 [machine.example.com](http://machine.example.com/) microshift[17919]: kubelet E1031 14:04:51.873013   17919 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"dns-node-resolver\" with ErrImagePull: \"rpc error: code = Unknown desc = reading manifest sha256:4d182d11a30e6c3c1420502bec5b1192c43c32977060c4def96ea160172f71e7 in [quay.io/openshift-release-dev/ocp-v4.0-art-dev](http://quay.io/openshift-release-dev/ocp-v4.0-art-dev): unauthorized: access to the requested resource is not authorized\"" pod="openshift-dns/node-resolver-45796" podUID=63e75c1a-9689-45db-b646-6eea0a58ed25

Looks like a problem with the pull secret.

adelton commented 1 year ago

Putting the pull secret both to ~/.pull-secret.json and /etc/crio/openshift-pull-secret seems to make that specific error message go away but the networking still gets reconfigured in such a way that the machine is no longer accessible via ssh on its original IP address.

ggiguash commented 1 year ago

@adelton, can you post the latest logs, having fixed the pull secret issue?

zshi-redhat commented 1 year ago

@adelton Could you also share the log from microshift-ovs-init systemd service and the output of the following cmds on the microshift node:

ip link show
ip addr show
ovs-vsctl show

The microshift-ovs-init service sets up an OVS bridge br-ex on the node interface, It flushes the IP of the node interface and regain the IP on br-ex bridge. The ssh disconnection might be caused by this network change.

adelton commented 1 year ago

The logs end with

Nov 02 15:47:20 machine.example.com systemd[1]: Started crio-conmon-5a98ec5d1c8971314fdd8e48cc9c6e240f43214be60655ccd9e6d679d30e03ee.scope.
Nov 02 15:47:20 machine.example.com systemd[1]: Started libcontainer container 5a98ec5d1c8971314fdd8e48cc9c6e240f43214be60655ccd9e6d679d30e03ee.
Nov 02 15:47:20 machine.example.com crio[17749]: time="2022-11-02 15:47:20.270924829+01:00" level=info msg="Created container 5a98ec5d1c8971314fdd8e48cc9c6e240f43214be60655ccd9e6d679d30e03ee: openshift-ovn-kubernetes/ovnkube-master-hbwrh/ovnkube-master" id=0bdb829e-1e18-41e9-8d24-fe5adb956706 name=/runtime.v1.RuntimeService/CreateContainer
Nov 02 15:47:20 machine.example.com crio[17749]: time="2022-11-02 15:47:20.271211225+01:00" level=info msg="Starting container: 5a98ec5d1c8971314fdd8e48cc9c6e240f43214be60655ccd9e6d679d30e03ee" id=da177da8-c72b-4306-87df-5822489a736f name=/runtime.v1.RuntimeService/StartContainer
Nov 02 15:47:20 machine.example.com crio[17749]: time="2022-11-02 15:47:20.277347279+01:00" level=info msg="Started container" PID=18543 containerID=5a98ec5d1c8971314fdd8e48cc9c6e240f43214be60655ccd9e6d679d30e03ee description=openshift-ovn-kubernetes/ovnkube-master-hbwrh/ovnkube-master id=da177da8-c72b-4306-87df-5822489a736f name=/runtime.v1.RuntimeService/StartContainer sandboxID=8ccd5dc1892194101430579909aa96b88c656f338217a3f037b7caa39596d08f
Nov 02 15:47:20 machine.example.com crio[17749]: time="2022-11-02 15:47:20.282084230+01:00" level=info msg="CNI monitoring event \"/opt/cni/bin/ovn-k8s-cni-overlay\": CREATE"
Nov 02 15:47:20 machine.example.com crio[17749]: time="2022-11-02 15:47:20.287254605+01:00" level=info msg="Found CNI network crio (type=bridge) at /etc/cni/net.d/100-crio-bridge.conf"
Nov 02 15:47:20 machine.example.com crio[17749]: time="2022-11-02 15:47:20.290184357+01:00" level=info msg="Found CNI network 200-loopback.conf (type=loopback) at /etc/cni/net.d/200-loopback.conf"
Nov 02 15:47:20 machine.example.com crio[17749]: time="2022-11-02 15:47:20.290200823+01:00" level=info msg="CNI monitoring event \"/opt/cni/bin/ovn-k8s-cni-overlay\": WRITE"
Nov 02 15:47:20 machine.example.com crio[17749]: time="2022-11-02 15:47:20.291713541+01:00" level=info msg="Found CNI network crio (type=bridge) at /etc/cni/net.d/100-crio-bridge.conf"
Nov 02 15:47:20 machine.example.com crio[17749]: time="2022-11-02 15:47:20.292704807+01:00" level=info msg="Found CNI network 200-loopback.conf (type=loopback) at /etc/cni/net.d/200-loopback.conf"
Nov 02 15:47:20 machine.example.com crio[17749]: time="2022-11-02 15:47:20.292716311+01:00" level=info msg="CNI monitoring event \"/opt/cni/bin/ovn-k8s-cni-overlay\": WRITE"
Nov 02 15:47:20 machine.example.com crio[17749]: time="2022-11-02 15:47:20.294226316+01:00" level=info msg="Found CNI network crio (type=bridge) at /etc/cni/net.d/100-crio-bridge.conf"
Nov 02 15:47:20 machine.example.com crio[17749]: time="2022-11-02 15:47:20.295363804+01:00" level=info msg="Found CNI network 200-loopback.conf (type=loopback) at /etc/cni/net.d/200-loopback.conf"
Nov 02 15:47:20 machine.example.com microshift[17970]: kubelet E1102 15:47:20.458263   17970 pod_workers.go:951] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/cni/net.d/. Has your network provider started?" pod="openshift-dns/dns-default-27mhp" podUID=e18e74e8-a5fb-485d-9cb6-a22b87049af2
Nov 02 15:47:20 machine.example.com ovs-vsctl[18654]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --timeout=15 set Open_vSwitch . "external_ids:ovn-remote=\"unix:/var/run/ovn/ovnsb_db.sock\""
Nov 02 15:47:20 machine.example.com ovs-vsctl[18655]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --timeout=15 set Open_vSwitch . external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=10.43.140.21 external_ids:ovn-remote-probe-interval=180000 external_ids:ovn-openflow-probe-interval=180 "external_ids:hostname=\"machine.example.com\"" external_ids:ovn-monitor-all=true external_ids:ovn-ofctrl-wait-before-clear=0 external_ids:ovn-enable-lflow-cache=false external_ids:ovn-memlimit-lflow-cache-kb=870
Nov 02 15:47:20 machine.example.com ovs-vsctl[18656]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --timeout=15 -- clear bridge br-int netflow -- clear bridge br-int sflow -- clear bridge br-int ipfix
Nov 02 15:47:20 machine.example.com ovs-vsctl[18658]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --timeout=15 -- --if-exists del-port br-int k8s-machine.example -- --may-exist add-port br-int ovn-k8s-mp0 -- set interface ovn-k8s-mp0 type=internal mtu_request=1400 external-ids:iface-id=k8s-machine.example.com
Nov 02 15:47:20 machine.example.com NetworkManager[15603]: <info>  [1667400440.5332] manager: (ovn-k8s-mp0): new Open vSwitch Interface device (/org/freedesktop/NetworkManager/Devices/10)
Nov 02 15:47:20 machine.example.com NetworkManager[15603]: <info>  [1667400440.5334] device (ovn-k8s-mp0): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')
Nov 02 15:47:20 machine.example.com NetworkManager[15603]: <info>  [1667400440.5337] manager: (ovn-k8s-mp0): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/11)
Nov 02 15:47:20 machine.example.com NetworkManager[15603]: <info>  [1667400440.5338] device (ovn-k8s-mp0): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'managed')
Nov 02 15:47:20 machine.example.com kernel: device ovn-k8s-mp0 entered promiscuous mode
Nov 02 15:47:20 machine.example.com systemd-udevd[18661]: Using default interface naming scheme 'rhel-8.0'.
Nov 02 15:47:20 machine.example.com systemd-udevd[18661]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Nov 02 15:47:20 machine.example.com systemd-udevd[18661]: Could not generate persistent MAC address for ovn-k8s-mp0: No such file or directory
Nov 02 15:47:20 machine.example.com ovs-vsctl[18665]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --timeout=15 set interface ovn-k8s-mp0 "mac=d2\\:e6\\:78\\:0c\\:8d\\:2d"
Nov 02 15:47:20 machine.example.com NetworkManager[15603]: <info>  [1667400440.5514] device (ovn-k8s-mp0): carrier: link connected
Nov 02 15:47:20 machine.example.com ovs-vsctl[18687]: ovs|00001|db_ctl_base|ERR|no port named br-ex
Nov 02 15:47:20 machine.example.com ovs-vsctl[18695]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --timeout=15 set Open_vSwitch . external_ids:ovn-bridge-mappings=physnet:br-ex
Nov 02 15:47:21 machine.example.com NetworkManager[15603]: <info>  [1667400441.0768] manager: (patch-br-ex_machine.example.com-to-br-int): new Open vSwitch Interface device (/org/freedesktop/NetworkManager/Devices/12)
Nov 02 15:47:21 machine.example.com NetworkManager[15603]: <info>  [1667400441.0770] device (patch-br-ex_machine.example.com-to-br-int): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')
Nov 02 15:47:21 machine.example.com NetworkManager[15603]: <info>  [1667400441.0773] manager: (patch-br-int-to-br-ex_machine.example.com): new Open vSwitch Interface device (/org/freedesktop/NetworkManager/Devices/13)
Nov 02 15:47:21 machine.example.com NetworkManager[15603]: <info>  [1667400441.0773] device (patch-br-int-to-br-ex_machine.example.com): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')
Nov 02 15:47:21 machine.example.com NetworkManager[15603]: <info>  [1667400441.0776] manager: (patch-br-ex_machine.example.com-to-br-int): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/14)
Nov 02 15:47:21 machine.example.com NetworkManager[15603]: <info>  [1667400441.0785] manager: (patch-br-int-to-br-ex_machine.example.com): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/15)
Nov 02 15:47:21 machine.example.com NetworkManager[15603]: <info>  [1667400441.0786] device (patch-br-ex_machine.example.com-to-br-int): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'managed')
Nov 02 15:47:21 machine.example.com NetworkManager[15603]: <info>  [1667400441.0787] device (patch-br-int-to-br-ex_machine.example.com): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'managed')
adelton commented 1 year ago

The microshift-ovs-init service sets up an OVS bridge br-ex on the node interface, It flushes the IP of the node interface and regain the IP on br-ex bridge. The ssh disconnection might be caused by this network change.

So what is the way of getting connected back to the host machine to actually inspect the microshift-ovs-init service? Because the machine no longer responds on the original IP address, even if I try a new ssh connection ...

zshi-redhat commented 1 year ago

The microshift-ovs-init service sets up an OVS bridge br-ex on the node interface, It flushes the IP of the node interface and regain the IP on br-ex bridge. The ssh disconnection might be caused by this network change.

So what is the way of getting connected back to the host machine to actually inspect the microshift-ovs-init service? Because the machine no longer responds on the original IP address, even if I try a new ssh connection ...

Unfortunately you cannot reconnect via the original IP address if the br-ex cannot regain IP address. Is there any additional host interface or virtual console that can be used to reconnect?

adelton commented 1 year ago

There is no other physical host interface and for automation purposes, using the console is not possible.

If this behaviour of microshift service and other services it starts (microshift-ovs-init) is expected, shouldn't there be steps described in the installation / setup instructions to show how to preserve access to the machine? For example, you say br-ex cannot regain IP address. Does it mean that it cannot redo the DHCP request? What IP address does it get set anyway? If we captured the DHCP-provided address before starting the installation and configuration and running of the microshift service, is there a way to "force" the same address (as a static one) to the post-br-ex setup?

ggiguash commented 1 year ago

@adelton, could you explain if the machine gets a different IP address, or all connectivity is lost?

adelton commented 1 year ago

I have no way of knowing. The machine is a remote one so my only option to figure out what is going on is to try to connect to it via ssh.

That's why I believe we need a very solid documentation for preserving that initial IP address and possibility of keeping the ssh connectivity.

ggiguash commented 1 year ago

/retitle NP-646: MicroShift should not cause the host IP to change on startup

pliurh commented 1 year ago

@adelton do you know how we can reproduce the issue?

adelton commented 1 year ago

My latest tests with microshift from the @redhat-et/microshift-testing copr repo on RHEL 8.6 no longer have the problem -- I'm able to ssh to that host just fine even after the node is reported as Ready and pods are (mostly) running:

# oc get pods -A
NAMESPACE                  NAME                                  READY   STATUS             RESTARTS       AGE
openshift-dns              dns-default-xc5dw                     2/2     Running            0              12m
openshift-dns              node-resolver-5j8bj                   1/1     Running            0              12m
openshift-ingress          router-default-7c9c47d97f-ld7mc       1/1     Running            0              12m
openshift-ovn-kubernetes   ovnkube-master-fhstc                  4/4     Running            0              12m
openshift-ovn-kubernetes   ovnkube-node-kspq4                    1/1     Running            0              12m
openshift-service-ca       service-ca-66b8869cf9-n48cv           1/1     Running            0              12m
openshift-storage          topolvm-controller-78876c5fcd-kcqj9   4/4     Running            0              12m
openshift-storage          topolvm-node-lj4m4                    2/4     CrashLoopBackOff   14 (56s ago)   12m
zshi-redhat commented 1 year ago

@adelton do you agree to close this issue given it works with your latest tests?

adelton commented 1 year ago

Sure, if it is clear where/how the change of behaviour happened in the code.

zshi-redhat commented 1 year ago

Sure, if it is clear where/how the change of behaviour happened in the code.

Thanks!

zshi-redhat commented 1 year ago

/close

openshift-ci[bot] commented 1 year ago

@zshi-redhat: Closing this issue.

In response to [this](https://github.com/openshift/microshift/issues/1061#issuecomment-1419034703): >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.