openshift / assisted-installer-agent

Apache License 2.0
24 stars 74 forks source link

assisted installer service getting error - chronyc: error while loading shared libraries: libnettle.so.8 #385

Open pdfruth opened 2 years ago

pdfruth commented 2 years ago

I'm using the self-hosted assisted installer service to install Single Node OKD. The assisted installer service is running in podman containers, as documented here

This method of doing a single node install of OKD used to work. But, has started to fail recently (within the last 30 days or so).

The host registers with the installer service, but gets stuck on an NTP synchronization failure as seen in the attached screen-shot

Screen Shot 2022-06-25 at 5 41 06 PM

Looking into the pod logs of the assisted installer service, I see this message;

level=error msg="Received step reply <ntp-synchronizer-392f0f02> from infra-env <ff4ce4b9-a3cd-4c50-b258-24cfbba8d1e3> host <68b15b04-5cb1-429f-9778-3c8727d0235d> exit-code <-1> stderr <chronyc exited with non-zero exit code 127: \nchronyc: error while loading shared libraries: libnettle.so.8: cannot open shared object file: No such file or directory\n> stdout <>" func=github.com/openshift/assisted-service/internal/bminventory.logReplyReceived file="/go/src/github.com/openshift/origin/internal/bminventory/inventory.go:2992" go-id=9762 host_id=68b15b04-5cb1-429f-9778-3c8727d0235d infra_env_id=ff4ce4b9-a3cd-4c50-b258-24cfbba8d1e3 pkg=Inventory request_id=6a4edac8-f290-4cb2-813e-f6a67ef9c50b

The relevant part of the message being - chronyc: error while loading shared libraries: libnettle.so.8: cannot open shared object file: No such file or directory

I believe the root cause for this is due to the changes introduced by this commit

The code change introduced by that commit mounts the chronyc command binary of the underlying OS (on which the assisted-installer-agent container runs on) into the /usr/bin directory inside the container. In my particular instance that host OS is Fedora CoreOS 35.20220327.3.0. The problem, in this case, is that the chronyc command is a dynamically linked ELF that depends on the libnettle.so.8 shared library... which isn't present in the container. The container does contain libnettle.so.6 tho.

Anyway, IMO this [bind-mounting the chronyc command from the underlying OS] is a containers anti-pattern.

Wouldn't it be a better approach to use the chronyc installed by the dnf install chrony in the docker file here, used to build the assisted installer agent container image.

@tsorya could you have a look at the change introduced in that commit. This introduces a significant pre-req of same shared library (that which the chronyc binary is dynamically linked) also be present on the assisted installer agent container image. Is there a different approach?

pdfruth commented 2 years ago

In the mean time, I've been able to work around the error by explicitly setting AGENT_DOCKER_IMAGE: quay.io/edge-infrastructure/assisted-installer-agent:v2.4.1 when customizing the sample okd-config.yml file here Note: v2.4.1 is the version of the image just prior to the introduction of the commit that introduced the problem mentioned above.

For example, here is an okd-configmap.yml that works for me today;

apiVersion: v1
kind: ConfigMap
metadata:
  name: config
data:
  ASSISTED_SERVICE_HOST: 192.168.10.2:8090
  ASSISTED_SERVICE_SCHEME: http
  AUTH_TYPE: none
  DB_HOST: 127.0.0.1
  DB_NAME: installer
  DB_PASS: admin
  DB_PORT: "5432"
  DB_USER: admin
  DEPLOY_TARGET: onprem
  DISK_ENCRYPTION_SUPPORT: "false"
  DUMMY_IGNITION: "false"
  ENABLE_SINGLE_NODE_DNSMASQ: "false"
  HW_VALIDATOR_REQUIREMENTS: '[{"version":"default","master":{"cpu_cores":4,"ram_mib":16384,"disk_size_gb":100,"installation_disk_speed_threshold_ms":10,"network_latency_threshold_ms":100,"packet_loss_percentage":0},"worker":{"cpu_cores":2,"ram_mib":8192,"disk_size_gb":100,"installation_disk_speed_threshold_ms":10,"network_latency_threshold_ms":1000,"packet_loss_percentage":10},"sno":{"cpu_cores":8,"ram_mib":16384,"disk_size_gb":100,"installation_disk_speed_threshold_ms":10}}]'
  IMAGE_SERVICE_BASE_URL: http://192.168.10.2:8888
  IPV6_SUPPORT: "true"
  LISTEN_PORT: "8888"
  NTP_DEFAULT_SERVER: ""
  POSTGRESQL_DATABASE: installer
  POSTGRESQL_PASSWORD: admin
  POSTGRESQL_USER: admin
  PUBLIC_CONTAINER_REGISTRIES: 'quay.io'
  SERVICE_BASE_URL: http://192.168.10.2:8090
  STORAGE: filesystem
  OS_IMAGES: '[{"openshift_version":"4.10","cpu_architecture":"x86_64","url":"https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/35.20220327.3.0/x86_64/fedora-coreos-35.20220327.3.0-live.x86_64.iso","rootfs_url":"https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/35.20220327.3.0/x86_64/fedora-coreos-35.20220327.3.0-live-rootfs.x86_64.img","version":"35.20220327.3.0"}]'
  RELEASE_IMAGES: '[{"openshift_version":"4.10","cpu_architecture":"x86_64","url":"quay.io/openshift/okd:4.10.0-0.okd-2022-06-10-131327","version":"4.10.0-0.okd-2022-06-10-131327","default":true}]'
  OKD_RPMS_IMAGE: quay.io/vrutkovs/okd-rpms:4.10
  AGENT_DOCKER_IMAGE: quay.io/edge-infrastructure/assisted-installer-agent:v2.4.1
tsorya commented 2 years ago

Hi, the commit that you mentioned actually fixes the problem we introduced in 2.4.0 where this mount was deleted by mistake. from https://github.com/openshift/assisted-installer-agent/blob/v2.3.1/src/commands/actions/ntp_sync_cmd.go#L44 you can see that we have this mount before and it was removed by mistake in 2.4.0 and returned in 2.4.1. We mount chronyc from this commit https://github.com/openshift/assisted-service/commit/7ec84480c31c16cc0e34a379dc8c5a08a6311b09 that took place in Nov'21.

On Sun, Jun 26, 2022 at 7:23 AM pdfruth @.***> wrote:

In the mean time, I've been able to work around the error by explicitly setting AGENT_DOCKER_IMAGE: quay.io/edge-infrastructure/assisted-installer-agent:v2.4.1 when customizing the sample okd-config.yml file here https://github.com/openshift/assisted-service/blob/master/deploy/podman/okd-configmap.yml

For example, here is an okd-configmap.yml that works for me today;

apiVersion: v1 kind: ConfigMap metadata: name: config data: ASSISTED_SERVICE_HOST: 192.168.10.2:8090 ASSISTED_SERVICE_SCHEME: http AUTH_TYPE: none DB_HOST: 127.0.0.1 DB_NAME: installer DB_PASS: admin DB_PORT: "5432" DB_USER: admin DEPLOY_TARGET: onprem DISK_ENCRYPTION_SUPPORT: "false" DUMMY_IGNITION: "false" ENABLE_SINGLE_NODE_DNSMASQ: "false" HW_VALIDATOR_REQUIREMENTS: '[{"version":"default","master":{"cpu_cores":4,"ram_mib":16384,"disk_size_gb":100,"installation_disk_speed_threshold_ms":10,"network_latency_threshold_ms":100,"packet_loss_percentage":0},"worker":{"cpu_cores":2,"ram_mib":8192,"disk_size_gb":100,"installation_disk_speed_threshold_ms":10,"network_latency_threshold_ms":1000,"packet_loss_percentage":10},"sno":{"cpu_cores":8,"ram_mib":16384,"disk_size_gb":100,"installation_disk_speed_threshold_ms":10}}]' IMAGE_SERVICE_BASE_URL: http://192.168.10.2:8888 IPV6_SUPPORT: "true" LISTEN_PORT: "8888" NTP_DEFAULT_SERVER: "" POSTGRESQL_DATABASE: installer POSTGRESQL_PASSWORD: admin POSTGRESQL_USER: admin PUBLIC_CONTAINER_REGISTRIES: 'quay.io' SERVICE_BASE_URL: http://192.168.10.2:8090 STORAGE: filesystem OS_IMAGES: '[{"openshift_version":"4.10","cpu_architecture":"x86_64","url":"https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/35.20220327.3.0/x86_64/fedora-coreos-35.20220327.3.0-live.x86_64.iso","rootfs_url":"https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/35.20220327.3.0/x86_64/fedora-coreos-35.20220327.3.0-live-rootfs.x86_64.img","version":"35.20220327.3.0"}]' RELEASE_IMAGES: '[{"openshift_version":"4.10","cpu_architecture":"x86_64","url":"quay.io/openshift/okd:4.10.0-0.okd-2022-06-10-131327","version":"4.10.0-0.okd-2022-06-10-131327","default":true}]' OKD_RPMS_IMAGE: quay.io/vrutkovs/okd-rpms:4.10 AGENT_DOCKER_IMAGE: quay.io/edge-infrastructure/assisted-installer-agent:v2.4.1

— Reply to this email directly, view it on GitHub https://github.com/openshift/assisted-installer-agent/issues/385#issuecomment-1166414861, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSNXFS3LHS65JHJQ3HFIR3VQ7LLVANCNFSM5Z3CTPFQ . You are receiving this because you were mentioned.Message ID: @.***>

--

Igal Tsoiref

He / His / Him

Red Hat Israel https://www.redhat.com/

34 Jerusalem rd. Ra'anana, 43501

@. @.> @RedHat https://twitter.com/redhat Red Hat https://www.linkedin.com/company/red-hat Red Hat https://www.facebook.com/RedHatInc https://red.ht/sig

omertuc commented 2 years ago

Anyway, IMO this [bind-mounting the chronyc command from the underlying OS] is a containers anti-pattern. Wouldn't it be a better approach to use the chronyc installed by the dnf install chrony in the docker file here, used to build the assisted installer agent container image.

It's not so simple as chronyc inside the agent container is communicating through a UDS socket mount with the host's operating system's non-containerized chronyd daemon, and so we're just moving the problem from "Host<->container shared library incompatibilities" to "Chronyc<->Chronyd socket API across versions incompatibility". Sadly the former affects OKD users, the latter affects (or at-least used to affect, maybe with recent RHCOS versions it has been solved) upstream OCP Assisted Installer agent users. I think there is no "right" answer between those two options, they're both bound to break (and have in the past), we've just chosen to solve the latter due to a user complaint a while ago, but we've done so in a problematic manner (mount), creating this issue for OKD users.

But we can do something else - ideally the solution here would be to disable the host's chronyd systemd service and have an equivalent, containerized chronyd service, but that's a big change. We should consider this probably

omertuc commented 2 years ago

Temporarily, as a workaround, we can solve it by not doing the bind when running on top of FCOS

omertuc commented 2 years ago

Created https://issues.redhat.com/browse/MGMT-10937 to track the workaround / solution

omertuc commented 2 years ago

cc @vrutkovs

openshift-bot commented 1 year ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

omertuc commented 1 year ago

/lifecycle frozen