grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
1.3k stars 175 forks source link

alloy service not restarted on package upgrade #1007

Open defanator opened 3 months ago

defanator commented 3 months ago

What's wrong?

We are trying to migrate from grafana-agent to alloy on a number of long-running VMs. It turned out that on some distros alloy service is not being restarted on package upgrade, e.g. the details below are from Ubuntu 20.04 host after upgrading alloy from 1.1.0 to 1.1.1.

root@staging2:/home/defan# grep alloy /var/log/dpkg.log
2024-06-10 03:57:55 upgrade alloy:arm64 1.1.0-1 1.1.1-1
2024-06-10 03:57:55 status half-configured alloy:arm64 1.1.0-1
2024-06-10 03:57:56 status unpacked alloy:arm64 1.1.0-1
2024-06-10 03:57:56 status half-installed alloy:arm64 1.1.0-1
2024-06-10 03:57:58 status unpacked alloy:arm64 1.1.1-1
2024-06-10 03:57:58 configure alloy:arm64 1.1.1-1 <none>
2024-06-10 03:57:58 status unpacked alloy:arm64 1.1.1-1
2024-06-10 03:57:58 status half-configured alloy:arm64 1.1.1-1
2024-06-10 03:57:58 status installed alloy:arm64 1.1.1-1

root@staging2:/home/defan# date
Mon Jun 10 04:01:13 UTC 2024

root@staging2:/home/defan# ps waux | grep alloy
alloy     245586  0.8  1.6 2321024 64288 ?       Ssl  Apr26 542:50 /usr/bin/alloy run --storage.path=/var/lib/alloy/data /etc/alloy/config.alloy

root@staging2:/home/defan# lsof -p 245586 | grep -i del
alloy   245586 alloy  txt       REG              259,1 217147728    3102 /usr/bin/alloy (deleted)
alloy   245586 alloy  DEL       REG              259,1              4105 /usr/lib/aarch64-linux-gnu/libnss_files-2.31.so
alloy   245586 alloy  DEL       REG              259,1              4098 /usr/lib/aarch64-linux-gnu/libc-2.31.so
alloy   245586 alloy  DEL       REG              259,1              4099 /usr/lib/aarch64-linux-gnu/libdl-2.31.so
alloy   245586 alloy  DEL       REG              259,1              4110 /usr/lib/aarch64-linux-gnu/libpthread-2.31.so
alloy   245586 alloy  DEL       REG              259,1              4092 /usr/lib/aarch64-linux-gnu/ld-2.31.so

root@staging2:/home/defan# cat /etc/os-release 
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

Steps to reproduce

I guess this should work:

  1. Install alloy 1.1.0 on Ubuntu 20.04.
  2. Configure and start the service.
  3. Upgrade alloy to 1.1.1 using apt-get install / apt-get upgrade.

At least these distros were affected in our case:

System information

Ubuntu 20.04 5.15.0-1058-aws #64~20.04.1-Ubuntu SMP Tue Apr 9 11:11:55 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux (+ others)

Software version

Alloy v1.1.0, v1.1.1

Configuration

/* logging - base settings */

local.file_match "logs_integrations_integrations_node_exporter_direct_scrape" {
    path_targets = [{
        __address__ = "localhost",
        __path__    = "/var/log/{auth,dpkg,kern,mail}.log",
        instance    = "staging2",
        job         = "integrations/node_exporter",
    }]
}

loki.source.file "logs_integrations_integrations_node_exporter_direct_scrape" {
    targets               = local.file_match.logs_integrations_integrations_node_exporter_direct_scrape.targets
    forward_to            = [loki.write.logs_grafana_cloud.receiver]
    legacy_positions_file = "/tmp/positions.yaml"
}

loki.write "logs_grafana_cloud" {
    endpoint {
        url = "https://logs-prod3.grafana.net/loki/api/v1/push"
        basic_auth {
            username = "XXXX"
            password = "XXXX"
        }
    }
    external_labels = {}
}

/* metrics - base settings */

discovery.relabel "integrations_node_exporter" {
    targets = prometheus.exporter.unix.integrations_node_exporter.targets
    rule {
        target_label = "agent_hostname"
        replacement  = constants.hostname
    }
    rule {
        target_label = "instance"
        replacement  = "staging2"
    }
    rule {
        target_label = "job"
        replacement  = "integrations/node_exporter"
    }
}

prometheus.exporter.unix "integrations_node_exporter" {
    disable_collectors = ["ipvs", "btrfs", "infiniband", "xfs", "zfs"]
    filesystem {
        fs_types_exclude     = "^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|tmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$"
        mount_points_exclude = "^/(dev|proc|run/credentials/.+|sys|var/lib/docker/.+)($|/)"
        mount_timeout        = "5s"
    }
    netclass {
        ignored_devices = "^(veth.*|cali.*|[a-f0-9]{15})$"
    }
    netdev {
        device_exclude = "^(veth.*|cali.*|[a-f0-9]{15})$"
    }
}

prometheus.scrape "integrations_node_exporter" {
    targets    = discovery.relabel.integrations_node_exporter.output
    forward_to = [prometheus.relabel.integrations_node_exporter.receiver]
    job_name   = "integrations/node_exporter"
    scrape_interval = "60s"
}

prometheus.relabel "integrations_node_exporter" {
    forward_to = [prometheus.remote_write.metrics_grafana_cloud.receiver]
    rule {
        source_labels = ["__name__"]
        regex         = "up|node_arp_entries|node_boot_time_seconds|node_context_switches_total|node_cpu_seconds_total|node_disk_io_time_seconds_total|node_disk_io_time_weighted_seconds_total|node_disk_read_bytes_total|node_disk_read_time_seconds_total|node_disk_reads_completed_total|node_disk_write_time_seconds_total|node_disk_writes_completed_total|node_disk_written_bytes_total|node_filefd_allocated|node_filefd_maximum|node_filesystem_avail_bytes|node_filesystem_device_error|node_filesystem_files|node_filesystem_files_free|node_filesystem_readonly|node_filesystem_size_bytes|node_intr_total|node_load1|node_load15|node_load5|node_md_disks|node_md_disks_required|node_memory_Active_anon_bytes|node_memory_Active_bytes|node_memory_Active_file_bytes|node_memory_AnonHugePages_bytes|node_memory_AnonPages_bytes|node_memory_Bounce_bytes|node_memory_Buffers_bytes|node_memory_Cached_bytes|node_memory_CommitLimit_bytes|node_memory_Committed_AS_bytes|node_memory_DirectMap1G_bytes|node_memory_DirectMap2M_bytes|node_memory_DirectMap4k_bytes|node_memory_Dirty_bytes|node_memory_HugePages_Free|node_memory_HugePages_Rsvd|node_memory_HugePages_Surp|node_memory_HugePages_Total|node_memory_Hugepagesize_bytes|node_memory_Inactive_anon_bytes|node_memory_Inactive_bytes|node_memory_Inactive_file_bytes|node_memory_Mapped_bytes|node_memory_MemAvailable_bytes|node_memory_MemFree_bytes|node_memory_MemTotal_bytes|node_memory_SReclaimable_bytes|node_memory_SUnreclaim_bytes|node_memory_ShmemHugePages_bytes|node_memory_ShmemPmdMapped_bytes|node_memory_Shmem_bytes|node_memory_Slab_bytes|node_memory_SwapTotal_bytes|node_memory_VmallocChunk_bytes|node_memory_VmallocTotal_bytes|node_memory_VmallocUsed_bytes|node_memory_WritebackTmp_bytes|node_memory_Writeback_bytes|node_netstat_Icmp6_InErrors|node_netstat_Icmp6_InMsgs|node_netstat_Icmp6_OutMsgs|node_netstat_Icmp_InErrors|node_netstat_Icmp_InMsgs|node_netstat_Icmp_OutMsgs|node_netstat_IpExt_InOctets|node_netstat_IpExt_OutOctets|node_netstat_TcpExt_ListenDrops|node_netstat_TcpExt_ListenOverflows|node_netstat_TcpExt_TCPSynRetrans|node_netstat_Tcp_InErrs|node_netstat_Tcp_InSegs|node_netstat_Tcp_OutRsts|node_netstat_Tcp_OutSegs|node_netstat_Tcp_RetransSegs|node_netstat_Udp6_InDatagrams|node_netstat_Udp6_InErrors|node_netstat_Udp6_NoPorts|node_netstat_Udp6_OutDatagrams|node_netstat_Udp6_RcvbufErrors|node_netstat_Udp6_SndbufErrors|node_netstat_UdpLite_InErrors|node_netstat_Udp_InDatagrams|node_netstat_Udp_InErrors|node_netstat_Udp_NoPorts|node_netstat_Udp_OutDatagrams|node_netstat_Udp_RcvbufErrors|node_netstat_Udp_SndbufErrors|node_network_carrier|node_network_info|node_network_mtu_bytes|node_network_receive_bytes_total|node_network_receive_compressed_total|node_network_receive_drop_total|node_network_receive_errs_total|node_network_receive_fifo_total|node_network_receive_multicast_total|node_network_receive_packets_total|node_network_speed_bytes|node_network_transmit_bytes_total|node_network_transmit_compressed_total|node_network_transmit_drop_total|node_network_transmit_errs_total|node_network_transmit_fifo_total|node_network_transmit_multicast_total|node_network_transmit_packets_total|node_network_transmit_queue_length|node_network_up|node_nf_conntrack_entries|node_nf_conntrack_entries_limit|node_os_info|node_sockstat_FRAG6_inuse|node_sockstat_FRAG_inuse|node_sockstat_RAW6_inuse|node_sockstat_RAW_inuse|node_sockstat_TCP6_inuse|node_sockstat_TCP_alloc|node_sockstat_TCP_inuse|node_sockstat_TCP_mem|node_sockstat_TCP_mem_bytes|node_sockstat_TCP_orphan|node_sockstat_TCP_tw|node_sockstat_UDP6_inuse|node_sockstat_UDPLITE6_inuse|node_sockstat_UDPLITE_inuse|node_sockstat_UDP_inuse|node_sockstat_UDP_mem|node_sockstat_UDP_mem_bytes|node_sockstat_sockets_used|node_softnet_dropped_total|node_softnet_processed_total|node_softnet_times_squeezed_total|node_systemd_unit_state|node_textfile_scrape_error|node_time_zone_offset_seconds|node_timex_estimated_error_seconds|node_timex_maxerror_seconds|node_timex_offset_seconds|node_timex_sync_status|node_uname_info|node_vmstat_oom_kill|node_vmstat_pgfault|node_vmstat_pgmajfault|node_vmstat_pgpgin|node_vmstat_pgpgout|node_vmstat_pswpin|node_vmstat_pswpout|process_max_fds|process_open_fds"
        action        = "keep"
    }
}

prometheus.remote_write "metrics_grafana_cloud" {
    external_labels = {
        role = "staging",
    }
    endpoint {
        url = "https://prometheus-us-central1.grafana.net/api/prom/push"
        basic_auth {
            username = "XXXX"
            password = "XXXX"
        }
        queue_config {}
        metadata_config {}
    }
}

/* metrics - memcached integration */

discovery.relabel "integrations_memcached_exporter" {
    targets = prometheus.exporter.memcached.integrations_memcached_exporter.targets
    rule {
        target_label = "job"
        replacement  = "integrations/memcached_exporter"
    }
    rule {
        target_label = "instance"
        replacement  = "staging2"
    }
    rule {
        target_label = "cluster"
        replacement  = "staging2"
    }
}

prometheus.exporter.memcached "integrations_memcached_exporter" {
    address = "127.0.0.1:11211"
}

prometheus.scrape "integrations_memcached_exporter" {
    targets    = discovery.relabel.integrations_memcached_exporter.output
    forward_to = [prometheus.relabel.integrations_memcached_exporter.receiver]
    job_name   = "integrations/memcached_exporter"
    scrape_interval = "60s"
}

prometheus.relabel "integrations_memcached_exporter" {
    forward_to = [prometheus.remote_write.metrics_grafana_cloud.receiver]
    rule {
        source_labels = ["__name__"]
        regex         = "up|memcached_commands_total|memcached_connections_total|memcached_current_bytes|memcached_current_connections|memcached_current_items|memcached_items_evicted_total|memcached_items_total|memcached_max_connections|memcached_read_bytes_total|memcached_up|memcached_uptime_seconds|memcached_version|memcached_written_bytes_total"
        action        = "keep"
    }
}

/* metrics - naas-agent-converter */

discovery.relabel "integrations_converter_exporter" {
    targets = [{
        __address__ = "localhost:9999",
    }]
    rule {
        target_label = "instance"
        replacement  = "staging2"
    }
}

prometheus.scrape "integrations_converter_exporter" {
    targets    = discovery.relabel.integrations_converter_exporter.output
    forward_to = [prometheus.remote_write.metrics_grafana_cloud.receiver]
    job_name   = "integrations/converter_exporter"
    scrape_interval = "60s"
}

Logs

root@staging2:/# fgrep alloy /var/log/dpkg.log.1
2024-05-17 05:25:19 upgrade alloy:arm64 1.0.0-1 1.1.0-1
2024-05-17 05:25:19 status half-configured alloy:arm64 1.0.0-1
2024-05-17 05:25:19 status unpacked alloy:arm64 1.0.0-1
2024-05-17 05:25:19 status half-installed alloy:arm64 1.0.0-1
2024-05-17 05:25:22 status unpacked alloy:arm64 1.1.0-1
2024-05-17 05:25:22 configure alloy:arm64 1.1.0-1 <none>
2024-05-17 05:25:22 status unpacked alloy:arm64 1.1.0-1
2024-05-17 05:25:22 status half-configured alloy:arm64 1.1.0-1
2024-05-17 05:25:22 status installed alloy:arm64 1.1.0-1

root@staging2:/# fgrep alloy /var/log/dpkg.log
2024-06-10 03:57:55 upgrade alloy:arm64 1.1.0-1 1.1.1-1
2024-06-10 03:57:55 status half-configured alloy:arm64 1.1.0-1
2024-06-10 03:57:56 status unpacked alloy:arm64 1.1.0-1
2024-06-10 03:57:56 status half-installed alloy:arm64 1.1.0-1
2024-06-10 03:57:58 status unpacked alloy:arm64 1.1.1-1
2024-06-10 03:57:58 configure alloy:arm64 1.1.1-1 <none>
2024-06-10 03:57:58 status unpacked alloy:arm64 1.1.1-1
2024-06-10 03:57:58 status half-configured alloy:arm64 1.1.1-1
2024-06-10 03:57:58 status installed alloy:arm64 1.1.1-1

root@staging2:/# journalctl -u alloy | tail -10
Jun 09 22:47:40 staging2.amp.nginx.com alloy[245586]: ts=2024-06-09T22:47:40.699511983Z level=info msg="usage report sent with success"
Jun 09 23:30:45 staging2.amp.nginx.com alloy[245586]: ts=2024-06-09T23:30:45.315310746Z level=info msg="series GC completed" component_path=/ component_id=prometheus.remote_write.metrics_grafana_cloud subcomponent=wal duration=1.858454ms
Jun 09 23:30:45 staging2.amp.nginx.com alloy[245586]: ts=2024-06-09T23:30:45.315894023Z level=info msg="Creating checkpoint" component_path=/ component_id=prometheus.remote_write.metrics_grafana_cloud subcomponent=wal from_segment=530 to_segment=531 mint=1717975524000
Jun 09 23:30:45 staging2.amp.nginx.com alloy[245586]: ts=2024-06-09T23:30:45.332714382Z level=info msg="WAL checkpoint complete" component_path=/ component_id=prometheus.remote_write.metrics_grafana_cloud subcomponent=wal first=530 last=531 duration=19.259738ms
Jun 10 01:30:45 staging2.amp.nginx.com alloy[245586]: ts=2024-06-10T01:30:45.33576315Z level=info msg="series GC completed" component_path=/ component_id=prometheus.remote_write.metrics_grafana_cloud subcomponent=wal duration=1.90548ms
Jun 10 02:47:40 staging2.amp.nginx.com alloy[245586]: ts=2024-06-10T02:47:40.581011479Z level=info msg="reporting Alloy stats" date=2024-06-10T02:47:40.581Z
Jun 10 02:47:40 staging2.amp.nginx.com alloy[245586]: ts=2024-06-10T02:47:40.700093085Z level=info msg="usage report sent with success"
Jun 10 03:30:45 staging2.amp.nginx.com alloy[245586]: ts=2024-06-10T03:30:45.33848674Z level=info msg="series GC completed" component_path=/ component_id=prometheus.remote_write.metrics_grafana_cloud subcomponent=wal duration=1.812959ms
Jun 10 03:30:45 staging2.amp.nginx.com alloy[245586]: ts=2024-06-10T03:30:45.338818482Z level=info msg="Creating checkpoint" component_path=/ component_id=prometheus.remote_write.metrics_grafana_cloud subcomponent=wal from_segment=532 to_segment=533 mint=1717989924000
Jun 10 03:30:45 staging2.amp.nginx.com alloy[245586]: ts=2024-06-10T03:30:45.349381775Z level=info msg="WAL checkpoint complete" component_path=/ component_id=prometheus.remote_write.metrics_grafana_cloud subcomponent=wal first=532 last=533 duration=12.707538ms
github-actions[bot] commented 2 months ago

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it. If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue. The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity. Thank you for your contributions!

defanator commented 3 weeks ago

Still happening with 1.3.x:

# apt-get -y install alloy
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following packages will be upgraded:
  alloy
1 upgraded, 0 newly installed, 0 to remove and 8 not upgraded.
Need to get 64.2 MB of archives.
After this operation, 0 B of additional disk space will be used.
Get:1 https://apt.grafana.com stable/main arm64 alloy arm64 1.3.1-1 [64.2 MB]
Fetched 64.2 MB in 4s (17.2 MB/s)
(Reading database ... 59003 files and directories currently installed.)
Preparing to unpack .../alloy_1.3.1-1_arm64.deb ...
Unpacking alloy (1.3.1-1) over (1.3.0-1) ...
Setting up alloy (1.3.1-1) ...

# ps waux | grep alloy
alloy     300892  0.6  2.5 2260612 99188 ?       Ssl  Aug19  89:59 /usr/bin/alloy run --disable-reporting --storage.path=/var/lib/alloy/data /etc/alloy/config.alloy
root      962520  0.0  0.0   6160  1692 pts/1    S+   03:00   0:00 grep alloy

# date
Thu Aug 29 03:00:21 AM UTC 2024