openstack-k8s-operators / edpm-ansible

External Dataplane Management Ansible Playbooks
https://openstack-k8s-operators.github.io/edpm-ansible/
Apache License 2.0
9 stars 66 forks source link

Don't use log files on EDPM nodes for networking related services #726

Closed cubeek closed 2 months ago

cubeek commented 2 months ago

Neutron services were changed only on OCP side as part of [1] but it was forgotten to change EDPM side too, tracked by [2]. This patch series removes logging to files and relies only on journald.

[1] https://issues.redhat.com/browse/OSPRH-154 [2] https://issues.redhat.com/browse/OSPRH-9248

openshift-ci[bot] commented 2 months ago

Hi @cubeek. Thanks for your PR.

I'm waiting for a openstack-k8s-operators member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
cubeek commented 2 months ago

recheck

softwarefactory-project-zuul[bot] commented 2 months ago

Build failed (check pipeline). Post recheck (without leading slash) to rerun all jobs. Make sure the failure cause has been resolved before you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/f42cc8a5a4d44941b38fb11fddc95a46

:heavy_check_mark: openstack-k8s-operators-content-provider SUCCESS in 2h 33m 01s :x: podified-multinode-edpm-deployment-crc RETRY_LIMIT in 17m 29s :x: cifmw-crc-podified-edpm-baremetal FAILURE in 2h 15m 49s :heavy_check_mark: edpm-ansible-molecule-edpm_bootstrap SUCCESS in 7m 13s :heavy_check_mark: edpm-ansible-molecule-edpm_podman SUCCESS in 6m 49s :heavy_check_mark: edpm-ansible-molecule-edpm_module_load SUCCESS in 5m 09s :heavy_check_mark: edpm-ansible-molecule-edpm_kernel SUCCESS in 12m 14s :heavy_check_mark: edpm-ansible-molecule-edpm_libvirt SUCCESS in 10m 02s :heavy_check_mark: edpm-ansible-molecule-edpm_nova SUCCESS in 11m 24s :heavy_check_mark: edpm-ansible-molecule-edpm_frr SUCCESS in 7m 47s :x: edpm-ansible-molecule-edpm_iscsid FAILURE in 4m 41s :heavy_check_mark: edpm-ansible-molecule-edpm_ovn_bgp_agent SUCCESS in 7m 55s :heavy_check_mark: edpm-ansible-molecule-edpm_ovs SUCCESS in 11m 46s :heavy_check_mark: edpm-ansible-molecule-edpm_tripleo_cleanup SUCCESS in 3m 41s :x: adoption-standalone-to-crc-ceph-provider FAILURE in 1h 42m 59s

booxter commented 2 months ago

/ok-to-test

booxter commented 2 months ago

recheck

booxter commented 2 months ago

/lgtm

rlandy commented 2 months ago

recheck

raukadah commented 2 months ago

recheck after dequeuing stuck zuul jobs

softwarefactory-project-zuul[bot] commented 2 months ago

Build failed (check pipeline). Post recheck (without leading slash) to rerun all jobs. Make sure the failure cause has been resolved before you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/a718d628679c437487b48302eca7c7a4

:heavy_check_mark: openstack-k8s-operators-content-provider SUCCESS in 3h 27m 27s :heavy_check_mark: podified-multinode-edpm-deployment-crc SUCCESS in 1h 42m 40s :heavy_check_mark: cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 51m 00s :heavy_check_mark: edpm-ansible-molecule-edpm_bootstrap SUCCESS in 6m 18s :heavy_check_mark: edpm-ansible-molecule-edpm_podman SUCCESS in 6m 10s :heavy_check_mark: edpm-ansible-molecule-edpm_module_load SUCCESS in 5m 13s :heavy_check_mark: edpm-ansible-molecule-edpm_kernel SUCCESS in 11m 20s :heavy_check_mark: edpm-ansible-molecule-edpm_libvirt SUCCESS in 9m 39s :heavy_check_mark: edpm-ansible-molecule-edpm_nova SUCCESS in 10m 53s :heavy_check_mark: edpm-ansible-molecule-edpm_frr SUCCESS in 7m 08s :heavy_check_mark: edpm-ansible-molecule-edpm_iscsid SUCCESS in 5m 06s :heavy_check_mark: edpm-ansible-molecule-edpm_ovn_bgp_agent SUCCESS in 7m 39s :heavy_check_mark: edpm-ansible-molecule-edpm_ovs SUCCESS in 12m 11s :heavy_check_mark: edpm-ansible-molecule-edpm_tripleo_cleanup SUCCESS in 4m 22s :x: adoption-standalone-to-crc-ceph-provider FAILURE in 3h 12m 02s

booxter commented 2 months ago

The job failed with

  - lastTransitionTime: "2024-08-16T10:07:37Z"
    message: 'DataPlaneNodeSet error occurred spec.tlsEnabled: Forbidden: TLS settings
      on Data Plane node set and Control Plane openstack do not match, Node set: false
      Control Plane Ingress: true Control Plane PodLevel: false'
    reason: Error
    severity: Error
    status: "False"
    type: SetupReady

This doesn't strike me as related to this PR.

openshift-ci[bot] commented 2 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cubeek, slawqo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openstack-k8s-operators/edpm-ansible/blob/main/OWNERS)~~ [slawqo] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
booxter commented 2 months ago
  [WARNING]: ERROR: Can't pull image quay.io/podified-antelope-centos9/openstack-
  ceilometer-compute:current-podified stderr: Trying to pull quay.io/podified-
  antelope-centos9/openstack-ceilometer-compute:current-podified... Getting image
  source signatures Copying blob
  sha256:c5c3accfc3e7c941c61b471205eefeb8067ec0ee44f7c399be64cd698b8e60e0 Copying
  blob sha256:c90e33e6a3a48417a06fbf7cae27a9c755e64d19ca2ebd207141a589a8d411b0
  Copying blob
  sha256:f14dad47a5d0864a97205e0b14878d920d04593a065a9ef212473bea3fb4cef1 Copying
  blob sha256:c143e920b7d46a7c824661f29329eca9a37bffc12ea1cfcad2aab5538545eeed
  Copying blob
  sha256:e8b54c863393b7a8216a71737771b73b16a8e43ec7acf927c7faa175c255e551 Error:
  reading blob
  sha256:f14dad47a5d0864a97205e0b14878d920d04593a065a9ef212473bea3fb4cef1:
  fetching blob: received unexpected HTTP status: 502 Bad Gateway
booxter commented 2 months ago

recheck

softwarefactory-project-zuul[bot] commented 2 months ago

Build failed (check pipeline). Post recheck (without leading slash) to rerun all jobs. Make sure the failure cause has been resolved before you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/c3ac5313e63a4551995fbad69664917e

:heavy_check_mark: openstack-k8s-operators-content-provider SUCCESS in 4h 46m 01s :heavy_check_mark: podified-multinode-edpm-deployment-crc SUCCESS in 1h 43m 17s :heavy_check_mark: cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 58m 54s :heavy_check_mark: edpm-ansible-molecule-edpm_bootstrap SUCCESS in 6m 03s :heavy_check_mark: edpm-ansible-molecule-edpm_podman SUCCESS in 5m 11s :heavy_check_mark: edpm-ansible-molecule-edpm_module_load SUCCESS in 4m 26s :heavy_check_mark: edpm-ansible-molecule-edpm_kernel SUCCESS in 8m 13s :heavy_check_mark: edpm-ansible-molecule-edpm_libvirt SUCCESS in 7m 44s :heavy_check_mark: edpm-ansible-molecule-edpm_nova SUCCESS in 8m 33s :heavy_check_mark: edpm-ansible-molecule-edpm_frr SUCCESS in 6m 05s :heavy_check_mark: edpm-ansible-molecule-edpm_iscsid SUCCESS in 4m 14s :heavy_check_mark: edpm-ansible-molecule-edpm_ovn_bgp_agent SUCCESS in 6m 34s :heavy_check_mark: edpm-ansible-molecule-edpm_ovs SUCCESS in 11m 45s :heavy_check_mark: edpm-ansible-molecule-edpm_tripleo_cleanup SUCCESS in 3m 49s :x: adoption-standalone-to-crc-ceph-provider FAILURE in 2h 48m 06s

fao89 commented 2 months ago

recheck

karelyatin commented 2 months ago

Needs to be rebased on main to get rid off edpm_tuned failures

Also no logs collected after the move https://logserver.rdoproject.org/26/726/aef38dab485fc18092f303d9a76c4b99f0b7af2a/github-check/cifmw-crc-podified-edpm-baremetal/88a8a02/ci-framework-data/logs/192.168.122.100/service_logs/ , so that also should be identified/fixed before merging this.

karelyatin commented 2 months ago

Needs to be rebased on main to get rid off edpm_tuned failures

Also no logs collected after the move https://logserver.rdoproject.org/26/726/aef38dab485fc18092f303d9a76c4b99f0b7af2a/github-check/cifmw-crc-podified-edpm-baremetal/88a8a02/ci-framework-data/logs/192.168.122.100/service_logs/ , so that also should be identified/fixed before merging this.

@SeanMooney @jpodivin may be you know what causing this(logs not collected with journalctl -u ) and if something still missing in the patch?

SeanMooney commented 2 months ago

which logs are missing

the service_logs directory is some additional custom logging i added for the ci jobs in the ci framework to grab the logs for systemd services related to openstack

systemctl list-units | awk '/virt|edpm/ {print $1}' | grep -v sys | xargs -I {} sudo bash -c 'journalctl -u {} > /tmp/{{ host_ip }}/service_logs/{}.log'

https://github.com/openstack-k8s-operators/ci-framework/blob/a191b190b56bd28c47359d34feb51b9907dcf0fb/roles/artifacts/tasks/edpm.yml#L88

so unless the neutron systemd service units are not correctly prefixed with edpm this should work for neutron

since ovs is host installed like libvirt you will need to add ovs to the regex like i did for libvirt.

this is not how the logs will be gather for customer however there the logs will be gather via must_gather and sos reports

the ovn logs do have some output

https://logserver.rdoproject.org/26/726/aef38dab485fc18092f303d9a76c4b99f0b7af2a/github-check/cifmw-crc-podified-edpm-baremetal/88a8a02/ci-framework-data/logs/192.168.122.100/service_logs/edpm_ovn_controller.service.log

https://logserver.rdoproject.org/26/726/aef38dab485fc18092f303d9a76c4b99f0b7af2a/github-check/cifmw-crc-podified-edpm-baremetal/88a8a02/ci-framework-data/logs/192.168.122.100/service_logs/edpm_ovn_metadata_agent.service.log

but its mostly empty

if i look at https://logserver.rdoproject.org/26/726/aef38dab485fc18092f303d9a76c4b99f0b7af2a/github-check/cifmw-crc-podified-edpm-baremetal/88a8a02/ci-framework-data/logs/192.168.122.100/log/messages

we do see ovn_controller log messages

the ci job is not configured to unpack the sosreports but normally you would find those logs in https://logserver.rdoproject.org/26/726/aef38dab485fc18092f303d9a76c4b99f0b7af2a/github-check/cifmw-crc-podified-edpm-baremetal/88a8a02/ci-framework-data/logs/openstack-k8s-operators-openstack-must-gather/sos-reports/_all_nodes/sosreport-edpm-compute-0.tar.xz

ci-framework-data/logs/192.168.122.100/log/ is a secondary way to look at that that predates must gather.

my guess is ovn is currently configured to log directly to the journal instead of standard out and as a result it is not logging with the edpm prefix and not being gathered in service_logs

i do not see ovs logs in var/log/messages but it looks like its still configured to log to a file based on the first line

2024-08-20T15:36:39.179Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log

https://logserver.rdoproject.org/26/726/aef38dab485fc18092f303d9a76c4b99f0b7af2a/github-check/cifmw-crc-podified-edpm-baremetal/88a8a02/ci-framework-data/logs/192.168.122.100/log/openvswitch/ovs-vswitchd.log

karelyatin commented 2 months ago

which logs are missing

Sorry i must have been explicit with log links, but i was looking mainly the empty logs in edpm_ovn_controller.service.log, edpm_ovn_metadata_agent.service.log etc which you spotted below.

the service_logs directory is some additional custom logging i added for the ci jobs in the ci framework to grab the logs for systemd services related to openstack

systemctl list-units | awk '/virt|edpm/ {print $1}' | grep -v sys | xargs -I {} sudo bash -c 'journalctl -u {} > /tmp/{{ host_ip }}/service_logs/{}.log'

Isn't that also the expected way to look for logs when checking live on system, i do normally do journalctl -u edpm_nova_compute and expected same would work for neutron agents as well with this patch. journalctl -u edpm_ovn_controller returns just service startup enteries with or without this patch.

https://github.com/openstack-k8s-operators/ci-framework/blob/a191b190b56bd28c47359d34feb51b9907dcf0fb/roles/artifacts/tasks/edpm.yml#L88

so unless the neutron systemd service units are not correctly prefixed with edpm this should work for neutron

systemd services are prefixed with edpm_ already and that's why we see those collected in service_logs but issue is those are mostly empty. since ovs is host installed like libvirt you will need to add ovs to the regex like i did for libvirt.

Ok thx that could be checked separately. This patch is not touching any ovs bits. this is not how the logs will be gather for customer however there the logs will be gather via must_gather and sos reports

the ovn logs do have some output

https://logserver.rdoproject.org/26/726/aef38dab485fc18092f303d9a76c4b99f0b7af2a/github-check/cifmw-crc-podified-edpm-baremetal/88a8a02/ci-framework-data/logs/192.168.122.100/service_logs/edpm_ovn_controller.service.log

https://logserver.rdoproject.org/26/726/aef38dab485fc18092f303d9a76c4b99f0b7af2a/github-check/cifmw-crc-podified-edpm-baremetal/88a8a02/ci-framework-data/logs/192.168.122.100/service_logs/edpm_ovn_metadata_agent.service.log

but its mostly empty

Yes trying to understand this part why the logs not visible here, what additional changes required for this?

if i look at https://logserver.rdoproject.org/26/726/aef38dab485fc18092f303d9a76c4b99f0b7af2a/github-check/cifmw-crc-podified-edpm-baremetal/88a8a02/ci-framework-data/logs/192.168.122.100/log/messages

we do see ovn_controller log messages

the ci job is not configured to unpack the sosreports but normally you would find those logs in https://logserver.rdoproject.org/26/726/aef38dab485fc18092f303d9a76c4b99f0b7af2a/github-check/cifmw-crc-podified-edpm-baremetal/88a8a02/ci-framework-data/logs/openstack-k8s-operators-openstack-must-gather/sos-reports/_all_nodes/sosreport-edpm-compute-0.tar.xz

ci-framework-data/logs/192.168.122.100/log/ is a secondary way to look at that that predates must gather.

my guess is ovn is currently configured to log directly to the journal instead of standard out and as a result it is not logging with the edpm prefix and not being gathered in service_logs

Need to see what need to be changed to log it properly. Any idea? i do not see ovs logs in var/log/messages but it looks like its still configured to log to a file based on the first line

2024-08-20T15:36:39.179Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log

Yes right, @cubeek you planned to handle ovs bits in separate patch?

https://logserver.rdoproject.org/26/726/aef38dab485fc18092f303d9a76c4b99f0b7af2a/github-check/cifmw-crc-podified-edpm-baremetal/88a8a02/ci-framework-data/logs/192.168.122.100/log/openvswitch/ovs-vswitchd.log

karelyatin commented 2 months ago

Okk i was able to reproduce it locally with ansibleee image from this PR quay.rdoproject.org/openstack-k8s-operators/openstack-ansibleee-runner:734a871f2663ecd078f917212b17c5e84b16b191

What i observed is after restarting the service logs are visible with journalctl -u.

` [zuul@networker-0 ~]$ sudo journalctl -u edpm_ovn_metadata_agent Aug 27 07:23:40 networker-0 systemd[1]: Starting ovn_metadata_agent container... Aug 27 07:23:40 networker-0 edpm-start-podman-container[20550]: Creating additional drop-in dependency for "ovn_metadata_agent" (2662e5> Aug 27 07:23:41 networker-0 systemd[1]: Started ovn_metadata_agent container.

[zuul@networker-0 ~]$ ps -eaf|grep metadata root 19938 1 0 07:23 ? 00:00:00 /usr/bin/conmon --api-version 1 -c 2662e5467da4310d27edb246057a84bca83216019ce35b1353abb44ea59267a4 -u 2662e5467da4310d27edb246057a84bca83216019ce35b1353abb44ea59267a4 -r /usr/bin/crun -b /var/lib/containers/storage/overlay-containers/2662e5467da4310d27edb246057a84bca83216019ce35b1353abb44ea59267a4/userdata -p /run/containers/storage/overlay-containers/2662e5467da4310d27edb246057a84bca83216019ce35b1353abb44ea59267a4/userdata/pidfile -n ovn_metadata_agent --exit-dir /run/libpod/exits --full-attach -s -l journald --log-level info --syslog --runtime-arg --log-format=json --runtime-arg --log --runtime-arg=/run/containers/storage/overlay-containers/2662e5467da4310d27edb246057a84bca83216019ce35b1353abb44ea59267a4/userdata/oci-log --conmon-pidfile /run/ovn_metadata_agent.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /run/containers/storage --exit-command-arg --log-level --exit-command-arg info --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /run/libpod --exit-command-arg --network-config-dir --exit-command-arg --exit-command-arg --network-backend --exit-command-arg netavark --exit-command-arg --volumepath --exit-command-arg /var/lib/containers/storage/volumes --exit-command-arg --db-backend --exit-command-arg sqlite --exit-command-arg --transient-store=false --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mountopt=nodev,metacopy=on --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg container --exit-command-arg cleanup --exit-command-arg 2662e5467da4310d27edb246057a84bca83216019ce35b1353abb44ea59267a4 root 19942 19940 0 07:23 ? 00:00:04 neutron-ovn-metadata-agent (/usr/bin/python3 /usr/bin/neutron-ovn-metadata-agent) root 20417 19942 0 07:23 ? 00:00:00 neutron-ovn-metadata-agent (/usr/bin/python3 /usr/bin/neutron-ovn-metadata-agent) root 20418 19942 0 07:23 ? 00:00:00 neutron-ovn-metadata-agent (/usr/bin/python3 /usr/bin/neutron-ovn-metadata-agent) zuul 21137 20841 0 07:51 pts/0 00:00:00 grep --color=auto metadata [zuul@networker-0 ~]$ [zuul@networker-0 ~]$ sudo systemctl restart edpm_ovn_metadata_agent

[zuul@networker-0 ~]$ ps -eaf|grep metadata root 21199 1 0 07:51 ? 00:00:00 /usr/bin/conmon --api-version 1 -c 2662e5467da4310d27edb246057a84bca83216019ce35b1353abb44ea59267a4 -u 2662e5467da4310d27edb246057a84bca83216019ce35b1353abb44ea59267a4 -r /usr/bin/crun -b /var/lib/containers/storage/overlay-containers/2662e5467da4310d27edb246057a84bca83216019ce35b1353abb44ea59267a4/userdata -p /run/containers/storage/overlay-containers/2662e5467da4310d27edb246057a84bca83216019ce35b1353abb44ea59267a4/userdata/pidfile -n ovn_metadata_agent --exit-dir /run/libpod/exits --full-attach -s -l journald --log-level warning --syslog --runtime-arg --log-format=json --runtime-arg --log --runtime-arg=/run/containers/storage/overlay-containers/2662e5467da4310d27edb246057a84bca83216019ce35b1353abb44ea59267a4/userdata/oci-log --conmon-pidfile /run/ovn_metadata_agent.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /run/containers/storage --exit-command-arg --log-level --exit-command-arg warning --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /run/libpod --exit-command-arg --network-config-dir --exit-command-arg --exit-command-arg --network-backend --exit-command-arg netavark --exit-command-arg --volumepath --exit-command-arg /var/lib/containers/storage/volumes --exit-command-arg --db-backend --exit-command-arg sqlite --exit-command-arg --transient-store=false --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mountopt=nodev,metacopy=on --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg container --exit-command-arg cleanup --exit-command-arg 2662e5467da4310d27edb246057a84bca83216019ce35b1353abb44ea59267a4 root 21203 21201 31 07:51 ? 00:00:03 neutron-ovn-metadata-agent (/usr/bin/python3 /usr/bin/neutron-ovn-metadata-agent) root 21261 21203 0 07:51 ? 00:00:00 neutron-ovn-metadata-agent (/usr/bin/python3 /usr/bin/neutron-ovn-metadata-agent) root 21262 21203 0 07:51 ? 00:00:00 neutron-ovn-metadata-agent (/usr/bin/python3 /usr/bin/neutron-ovn-metadata-agent) zuul 21271 20841 0 07:51 pts/0 00:00:00 grep --color=auto metadata [zuul@networker-0 ~]$ sudo journalctl -u edpm_ovn_metadata_agent Aug 27 07:23:40 networker-0 systemd[1]: Starting ovn_metadata_agent container... Aug 27 07:23:40 networker-0 edpm-start-podman-container[20550]: Creating additional drop-in dependency for "ovn_metadata_agent" (2662e5> Aug 27 07:23:41 networker-0 systemd[1]: Started ovn_metadata_agent container. Aug 27 07:51:43 networker-0 systemd[1]: Stopping ovn_metadata_agent container... Aug 27 07:51:44 networker-0 podman[21142]: 2024-08-27 07:51:44.142577336 +0000 UTC m=+0.873332196 container died 2662e5467da4310d27edb2> Aug 27 07:51:44 networker-0 podman[21142]: 2024-08-27 07:51:44.18973634 +0000 UTC m=+0.920491170 container cleanup 2662e5467da4310d27ed> Aug 27 07:51:44 networker-0 podman[21142]: ovn_metadata_agent Aug 27 07:51:44 networker-0 podman[21170]: ovn_metadata_agent Aug 27 07:51:44 networker-0 systemd[1]: edpm_ovn_metadata_agent.service: Deactivated successfully. Aug 27 07:51:44 networker-0 systemd[1]: Stopped ovn_metadata_agent container. Aug 27 07:51:44 networker-0 systemd[1]: Starting ovn_metadata_agent container... Aug 27 07:51:44 networker-0 podman[21184]: 2024-08-27 07:51:44.39492599 +0000 UTC m=+0.089644466 container init 2662e5467da4310d27edb24> Aug 27 07:51:44 networker-0 podman[21184]: 2024-08-27 07:51:44.401012455 +0000 UTC m=+0.095730914 container start 2662e5467da4310d27edb> Aug 27 07:51:44 networker-0 edpm-start-podman-container[21184]: ovn_metadata_agent Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: + sudo -E kolla_set_configs Aug 27 07:51:44 networker-0 edpm-start-podman-container[21183]: Creating additional drop-in dependency for "ovn_metadata_agent" (2662e5> Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Loading config file at /var/lib/kolla/config_files/config.json Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Validating config file Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Kolla config strategy set to: COPY_ALWAYS Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Copying service configuration files Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Deleting /etc/neutron/rootwrap.conf Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Copying /etc/neutron.conf.d/01-rootwrap.conf to /etc/neutron/rootw> Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Setting permission for /etc/neutron/rootwrap.conf Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Writing out command to execute Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Setting permission for /var/lib/neutron Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Setting permission for /var/lib/neutron/kill_scripts Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Setting permission for /var/lib/neutron/ovn-metadata-proxy Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Setting permission for /var/lib/neutron/external Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Setting permission for /var/lib/neutron/ovn_metadata_haproxy_wrapp> Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Setting permission for /var/lib/neutron/metadata_proxy Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Setting permission for /var/lib/neutron/kill_scripts/haproxy-kill Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Setting permission for /var/lib/neutron/external/pids Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: ++ cat /run_command Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: + CMD=neutron-ovn-metadata-agent [zuul@networker-0 ~]$ sudo journalctl -u edpm_ovn_metadata_agent Aug 27 07:23:40 networker-0 systemd[1]: Starting ovn_metadata_agent container... Aug 27 07:23:40 networker-0 edpm-start-podman-container[20550]: Creating additional drop-in dependency for "ovn_metadata_agent" (2662e5> Aug 27 07:23:41 networker-0 systemd[1]: Started ovn_metadata_agent container. Aug 27 07:51:43 networker-0 systemd[1]: Stopping ovn_metadata_agent container... Aug 27 07:51:44 networker-0 podman[21142]: 2024-08-27 07:51:44.142577336 +0000 UTC m=+0.873332196 container died 2662e5467da4310d27edb2> Aug 27 07:51:44 networker-0 podman[21142]: 2024-08-27 07:51:44.18973634 +0000 UTC m=+0.920491170 container cleanup 2662e5467da4310d27ed> Aug 27 07:51:44 networker-0 podman[21142]: ovn_metadata_agent Aug 27 07:51:44 networker-0 podman[21170]: ovn_metadata_agent Aug 27 07:51:44 networker-0 systemd[1]: edpm_ovn_metadata_agent.service: Deactivated successfully. Aug 27 07:51:44 networker-0 systemd[1]: Stopped ovn_metadata_agent container. Aug 27 07:51:44 networker-0 systemd[1]: Starting ovn_metadata_agent container... Aug 27 07:51:44 networker-0 podman[21184]: 2024-08-27 07:51:44.39492599 +0000 UTC m=+0.089644466 container init 2662e5467da4310d27edb24> Aug 27 07:51:44 networker-0 podman[21184]: 2024-08-27 07:51:44.401012455 +0000 UTC m=+0.095730914 container start 2662e5467da4310d27edb> Aug 27 07:51:44 networker-0 edpm-start-podman-container[21184]: ovn_metadata_agent Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: + sudo -E kolla_set_configs Aug 27 07:51:44 networker-0 edpm-start-podman-container[21183]: Creating additional drop-in dependency for "ovn_metadata_agent" (2662e5> Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Loading config file at /var/lib/kolla/config_files/config.json Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Validating config file Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Kolla config strategy set to: COPY_ALWAYS Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Copying service configuration files Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Deleting /etc/neutron/rootwrap.conf Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Copying /etc/neutron.conf.d/01-rootwrap.conf to /etc/neutron/rootw> Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Setting permission for /etc/neutron/rootwrap.conf Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Writing out command to execute Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Setting permission for /var/lib/neutron Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Setting permission for /var/lib/neutron/kill_scripts Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Setting permission for /var/lib/neutron/ovn-metadata-proxy Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Setting permission for /var/lib/neutron/external Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Setting permission for /var/lib/neutron/ovn_metadata_haproxy_wrapp> Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Setting permission for /var/lib/neutron/metadata_proxy Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Setting permission for /var/lib/neutron/kill_scripts/haproxy-kill Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: INFO:main:Setting permission for /var/lib/neutron/external/pids Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: ++ cat /run_command Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: + CMD=neutron-ovn-metadata-agent Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: + ARGS= Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: + sudo kolla_copy_cacerts Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: + [[ ! -n '' ]] Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: + . kolla_extend_start Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: Running command: 'neutron-ovn-metadata-agent' Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: + echo 'Running command: '\''neutron-ovn-metadata-agent'\''' Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: + umask 0022 Aug 27 07:51:44 networker-0 ovn_metadata_agent[21199]: + exec neutron-ovn-metadata-agent Aug 27 07:51:44 networker-0 systemd[1]: Started ovn_metadata_agent container. Aug 27 07:51:47 networker-0 ovn_metadata_agent[21199]: 2024-08-27 07:51:47.425 21203 INFO neutron.common.config [-] Logging enabled! Aug 27 07:51:47 networker-0 ovn_metadata_agent[21199]: 2024-08-27 07:51:47.425 21203 INFO neutron.common.config [-] /usr/bin/neutron-ov> Aug 27 07:51:47 networker-0 ovn_metadata_agent[21199]: 2024-08-27 07:51:47.425 21203 DEBUG neutron.common.config [-] command line: /usr> Aug 27 07:51:47 networker-0 ovn_metadata_agent[21199]: 2024-08-27 07:51:47.426 21203 DEBUG neutron.agent.ovn.metadata_agent [-] ***> Aug 27 07:51:47 networker-0 ovn_metadata_agent[21199]: 2024-08-27 07:51:47.426 21203 DEBUG neutron.agent.ovn.metadata_agent [-] Configu> Aug 27 07:51:47 networker-0 ovn_metadata_agent[21199]: 2024-08-27 07:51:47.426 21203 DEBUG neutron.agent.ovn.metadata_agent [-] command> `

Spotted a diff after restart of the service for the conmon process:- -log-level info --exit-command-arg info vs -log-level warning --exit-command-arg warning

nova_compute is also impacted but the issue didn't got visible as it got restarted during the deployment itself with handlers[1]. Other services running on edpm nodes might also be impacted if not restarted during deployment, like i see similar behavior for edpm_iscsid/edpm_logrorate_chrond. Still don't know why the behavior is seen with first start and vanishing after service restart. Locally for logrotate_chrond it's not restarted but logging correctly[3], in CI logs[2] it's not so looks some kind of race.

[1] https://github.com/search?q=repo%3Aopenstack-k8s-operators%2Fedpm-ansible%20%22Restart%20Nova%22&type=code [2] https://logserver.rdoproject.org/26/726/aef38dab485fc18092f303d9a76c4b99f0b7af2a/github-check/cifmw-crc-podified-edpm-baremetal/88a8a02/ci-framework-data/logs/192.168.122.100/service_logs/edpm_logrotate_crond.service.log [3] [zuul@networker-0 ~]$ sudo journalctl -u edpm_logrotate_crond Aug 27 07:20:12 networker-0 systemd[1]: Starting logrotate_crond container... Aug 27 07:20:12 networker-0 podman[1532]: 2024-08-27 07:20:12.818086548 +0000 UTC m=+0.185700262 system refresh Aug 27 07:20:12 networker-0 podman[1532]: 2024-08-27 07:20:12.916732967 +0000 UTC m=+0.284346635 container init 708b0af590a3ff9c5dea0a0> Aug 27 07:20:12 networker-0 podman[1532]: 2024-08-27 07:20:12.923712705 +0000 UTC m=+0.291326367 container start 708b0af590a3ff9c5dea0a> Aug 27 07:20:12 networker-0 podman[1532]: logrotate_crond Aug 27 07:20:12 networker-0 systemd[1]: Started logrotate_crond container. Aug 27 07:20:12 networker-0 logrotate_crond[1650]: + sudo -E kolla_set_configs Aug 27 07:20:13 networker-0 logrotate_crond[1650]: INFO:__main__:Loading config file at /var/lib/kolla/config_files/config.json Aug 27 07:20:13 networker-0 logrotate_crond[1650]: INFO:__main__:Validating config file Aug 27 07:20:13 networker-0 logrotate_crond[1650]: INFO:__main__:Kolla config strategy set to: COPY_ALWAYS Aug 27 07:20:13 networker-0 logrotate_crond[1650]: INFO:__main__:Copying service configuration files Aug 27 07:20:13 networker-0 logrotate_crond[1650]: INFO:__main__:Deleting /etc/logrotate-crond.conf Aug 27 07:20:13 networker-0 logrotate_crond[1650]: INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/logrotate-crond.conf to /e> Aug 27 07:20:13 networker-0 logrotate_crond[1650]: INFO:__main__:Deleting /var/spool/cron/root Aug 27 07:20:13 networker-0 logrotate_crond[1650]: INFO:__main__:Copying /var/lib/kolla/config_files/src/var/spool/cron/root to /var/sp> Aug 27 07:20:13 networker-0 logrotate_crond[1650]: INFO:__main__:Writing out command to execute Aug 27 07:20:13 networker-0 logrotate_crond[1650]: ++ cat /run_command Aug 27 07:20:13 networker-0 logrotate_crond[1650]: + CMD='/usr/sbin/crond -s -n' Aug 27 07:20:13 networker-0 logrotate_crond[1650]: + ARGS= Aug 27 07:20:13 networker-0 logrotate_crond[1650]: + sudo kolla_copy_cacerts Aug 27 07:20:13 networker-0 logrotate_crond[1650]: + [[ ! -n '' ]] Aug 27 07:20:13 networker-0 logrotate_crond[1650]: + . kolla_extend_start Aug 27 07:20:13 networker-0 logrotate_crond[1650]: Running command: '/usr/sbin/crond -s -n' Aug 27 07:20:13 networker-0 logrotate_crond[1650]: + echo 'Running command: '\''/usr/sbin/crond -s -n'\''' Aug 27 07:20:13 networker-0 logrotate_crond[1650]: + umask 0022 Aug 27 07:20:13 networker-0 logrotate_crond[1650]: + exec /usr/sbin/crond -s -n

SeanMooney commented 2 months ago

glad you found the issue, yes i intentionally used handeler to restart the nova compute container anytime there is a config change or definition change for the contienr to ensure we are always running with the correct content.

journalctl -u is the expected way to view live logs (i normally do journalctl -f -n 1000 -u )

but it makes sense that if the container was not restarted then the change did not take effect yet.

cubeek commented 2 months ago

To my understanding the deploying tool doesn't respect the settings, is that correct? Does it affect all services, not just Neutron? I will report a bug for it

cubeek commented 2 months ago

Reported https://issues.redhat.com/browse/OSPRH-9723 for the starter

karelyatin commented 2 months ago

glad you found the issue, yes i intentionally used handeler to restart the nova compute container anytime there is a config change or definition change for the contienr to ensure we are always running with the correct content.

@SeanMooney ok so that restart here is basically hiding this behavior for nova_compute. As with fresh deployment nova_compute is started with up to date config and immediately it also restarts which shouldn't be required with fresh setup. Also observed nova_compute is restarted on rerunning the deployment without any change due to these current handlers but that's separate issue and can be checked seperately. journalctl -u is the expected way to view live logs (i normally do journalctl -f -n 1000 -u )

Ok so this interface will not work for network agents with current state of this PR unless and until we have some fix or workaround(podman stop or restart or something else) applied. Is it ok to get this merged without any workaround applied or add a workaround similar to nova and later cleanup as part of the issue reported by Jakub? Me not sure if there is something needs to be done on systemd or podmon or conmon side to get this behavior resolved but it makes sense that if the container was not restarted then the change did not take effect yet.

Config wise everything is in place it's happening due to how we setting things up.

So ovn_controller and metadata agents correctly logging to stdout and journal, as correct fields not set these will not be visible with journalctl -u or systemctl status to view these. Also /me don't know how centralized logging Epic will actually collects the logs from edpm nodes, if it filters based on systemd unit name these logs will get missed.

SeanMooney commented 2 months ago

the handeler shoudl only run if specific tasks were marked as changed.

on first run it will restart them because we wrote the the configs but i don't want to remove that restart because 1 its fast and two the hanelers approach is more consistent in general.

i previously proposed adopting that pattern for all services. I'm not sure if other took that advice or not.

cubeek commented 2 months ago

the handeler shoudl only run if specific tasks were marked as changed.

on first run it will restart them because we wrote the the configs but i don't want to remove that restart because 1 its fast and two the hanelers approach is more consistent in general.

i previously proposed adopting that pattern for all services. I'm not sure if other took that advice or not.

This is probably not the best place to discuss this but since we're talking about it, what is the reason again for restarting services when we deploy them? It doesn't sound optimal as we already know the config settings prior to creating the containers running the services, so why restarting services? Is it because some config changes are figured out after the service was started?

SeanMooney commented 2 months ago

the handeler shoudl only run if specific tasks were marked as changed. on first run it will restart them because we wrote the the configs but i don't want to remove that restart because 1 its fast and two the hanelers approach is more consistent in general. i previously proposed adopting that pattern for all services. I'm not sure if other took that advice or not.

This is probably not the best place to discuss this but since we're talking about it, what is the reason again for restarting services when we deploy them? It doesn't sound optimal as we already know the config settings prior to creating the containers running the services, so why restarting services? Is it because some config changes are figured out after the service was started?

we only restart the service if the config or container definition changes. we do not want to have a different code patch for first deploy and any subsequent execution of the role to reduce bugs like the one found for neutorn/ovn containers. as a result its a known expectation that for greenfiled we will have 1 additional restart because on the very first deployment, we update the config by creating it for the first time

this is a very specific edge case that i intentionally did not want to optimise for as it pessimies every other deployment after the very first one.

SeanMooney commented 2 months ago

the handeler shoudl only run if specific tasks were marked as changed. on first run it will restart them because we wrote the the configs but i don't want to remove that restart because 1 its fast and two the hanelers approach is more consistent in general. i previously proposed adopting that pattern for all services. I'm not sure if other took that advice or not.

This is probably not the best place to discuss this but since we're talking about it, what is the reason again for restarting services when we deploy them? It doesn't sound optimal as we already know the config settings prior to creating the containers running the services, so why restarting services? Is it because some config changes are figured out after the service was started?

we only restart the service if the config or container definition changes. we do not want to have a different code patch for first deploy and any subsequent execution of the role to reduce bugs like the one found for neutorn/ovn containers. as a result its a known expectation that for greenfiled we will have 1 additional restart because on the very first deployment, we update the config by creating it for the first time

this is a very specific edge case that i intentionally did not want to optimise for as it pessimies every other deployment after the very first one.

perhaps a little more context is helpful

https://github.com/openstack-k8s-operators/edpm-ansible/blob/06f28cf2f0b74292431b617b9599cc210d4e5470/roles/edpm_nova/tasks/configure.yml#L139-L177

we only restart if the tasks that are explicitly marked with

notify:
    - Restart nova

actually have the changed status.

so if the content of the config is unchanged it will return OK an no restart is done.

it would be suboptimal to have to complicate the code to cater for the first install call to avoid a trivial systems service restart

https://github.com/openstack-k8s-operators/edpm-ansible/blob/06f28cf2f0b74292431b617b9599cc210d4e5470/roles/edpm_nova/handlers/main.yml#L16-L21

softwarefactory-project-zuul[bot] commented 2 months ago

Build failed (check pipeline). Post recheck (without leading slash) to rerun all jobs. Make sure the failure cause has been resolved before you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/300d32c39a464164ab3f32ec3c3f173b

:heavy_check_mark: openstack-k8s-operators-content-provider SUCCESS in 1h 42m 29s :heavy_check_mark: podified-multinode-edpm-deployment-crc SUCCESS in 1h 15m 29s :heavy_check_mark: cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 29m 12s :heavy_check_mark: edpm-ansible-molecule-edpm_bootstrap SUCCESS in 7m 08s :heavy_check_mark: edpm-ansible-molecule-edpm_podman SUCCESS in 5m 53s :heavy_check_mark: edpm-ansible-molecule-edpm_module_load SUCCESS in 5m 11s :heavy_check_mark: edpm-ansible-molecule-edpm_kernel SUCCESS in 10m 08s :heavy_check_mark: edpm-ansible-molecule-edpm_libvirt SUCCESS in 10m 09s :heavy_check_mark: edpm-ansible-molecule-edpm_nova SUCCESS in 10m 20s :heavy_check_mark: edpm-ansible-molecule-edpm_frr SUCCESS in 6m 56s :heavy_check_mark: edpm-ansible-molecule-edpm_iscsid SUCCESS in 4m 44s :heavy_check_mark: edpm-ansible-molecule-edpm_ovn_bgp_agent SUCCESS in 7m 48s :heavy_check_mark: edpm-ansible-molecule-edpm_ovs SUCCESS in 12m 23s :heavy_check_mark: edpm-ansible-molecule-edpm_tripleo_cleanup SUCCESS in 4m 00s :heavy_check_mark: edpm-ansible-molecule-edpm_tuned SUCCESS in 6m 08s :x: adoption-standalone-to-crc-ceph-provider RETRY_LIMIT in 22m 35s

cubeek commented 2 months ago

recheck

booxter commented 2 months ago

/lgtm

karelyatin commented 2 months ago

/cherry-pick 18.0.0-proposed

openshift-cherrypick-robot commented 2 months ago

@karelyatin: new pull request created: #742

In response to [this](https://github.com/openstack-k8s-operators/edpm-ansible/pull/726#issuecomment-2319988622): >/cherry-pick 18.0.0-proposed Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
karelyatin commented 2 months ago

glad you found the issue, yes i intentionally used handeler to restart the nova compute container anytime there is a config change or definition change for the contienr to ensure we are always running with the correct content.

@SeanMooney ok so that restart here is basically hiding this behavior for nova_compute. As with fresh deployment nova_compute is started with up to date config and immediately it also restarts which shouldn't be required with fresh setup. Also observed nova_compute is restarted on rerunning the deployment without any change due to these current handlers but that's separate issue and can be checked seperately.

journalctl -u is the expected way to view live logs (i normally do journalctl -f -n 1000 -u )

Ok so this interface will not work for network agents with current state of this PR unless and until we have some fix or workaround(podman stop or restart or something else) applied. Is it ok to get this merged without any workaround applied or add a workaround similar to nova and later cleanup as part of the issue reported by Jakub? Me not sure if there is something needs to be done on systemd or podmon or conmon side to get this behavior resolved

but it makes sense that if the container was not restarted then the change did not take effect yet.

Config wise everything is in place it's happening due to how we setting things up.

* ```
  Create/Start container with podman run -n <container name>
  ```

* ```
  Create systemd unit with ExecStart=podman start <container name> and enable/start it
  ```

* ```
  Since container is already started podman start is a noop
  ```

* ```
  With podman/conmon restart or systemd restart logs are visible(as fields like _SYSTEMD_SLICE, _SYSTEMD_CGROUP, _SYSTEMD_UNIT etc are updated properly), seems that retriggers process recreate(conmon and that takes care of those logs link to systemd unit, iiuc done as part of https://github.com/containers/conmon/blob/3bc422cd8aaec542d85d1a80f2d38e6e69046b5b/src/ctr_logging.c#L254)
  ```

To fix this particular part(to not start container with podman run and let systemd start it) @cubeek pushed https://github.com/openstack-k8s-operators/edpm-ansible/pull/743

Logs too collected fine now:- https://logserver.rdoproject.org/43/743/80e26ca1535bd248dfa5ddcc5d603dca62c86b0c/github-check/podified-multinode-edpm-deployment-crc/03bfe31/controller/ci-framework-data/logs/38.102.83.224/service_logs/edpm_ovn_controller.service.log https://logserver.rdoproject.org/43/743/80e26ca1535bd248dfa5ddcc5d603dca62c86b0c/github-check/cifmw-crc-podified-edpm-baremetal/db6be93/ci-framework-data/logs/192.168.122.100/service_logs/edpm_ovn_controller.service.log

So ovn_controller and metadata agents correctly logging to stdout and journal, as correct fields not set these will not be visible with journalctl -u or systemctl status to view these. Also /me don't know how centralized logging Epic will actually collects the logs from edpm nodes, if it filters based on systemd unit name these logs will get missed.