grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
1.25k stars 157 forks source link

Alloy panics on armv7hf #1099

Closed imavroukakis closed 2 months ago

imavroukakis commented 2 months ago

What's wrong?

Alloy fails with a panic shortly after starting up

Steps to reproduce

Build from source and run the agent on an armv7hf causes it to panic - in contrast the static agent, also compiled from source, works fine.

System information

Linux 5.x armv7hf

Software version

v1.1.1

Configuration

import.git "grafana_cloud" {
  repository = "https://github.com/grafana/alloy-modules.git"
  revision = "main"
  path = "modules/cloud/grafana/cloud/module.alloy"
  pull_frequency = "15m"
}

Logs

ts=2024-06-20T14:08:02.00501765Z level=info "boringcrypto enabled"=false ts=2024-06-20T14:08:02.005184662Z level=info msg="running usage stats reporter" ts=2024-06-20T14:08:02.005774369Z level=info msg="starting complete graph evaluation" controller_path=/ controller_id="" trace_id=55f977c4cb7f626c5342ab787f58ea25 
ts=2024-06-20T14:08:02.006133394Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=55f977c4cb7f626c5342ab787f58ea25 node_id=tracing duration=51.004µs 
ts=2024-06-20T14:08:02.006384412Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=55f977c4cb7f626c5342ab787f58ea25 node_id=logging duration=1.376763ms 
ts=2024-06-20T14:08:05.943629171Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=55f977c4cb7f626c5342ab787f58ea25 node_id=import.git.grafana_cloud duration=3.937067413s ts=2024-06-20T14:08:05.944028866Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=55f977c4cb7f626c5342ab787f58ea25 node_id=otel duration=32.002µs 
ts=2024-06-20T14:08:05.944303218Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=55f977c4cb7f626c5342ab787f58ea25 node_id=labelstore duration=80.672µs 
ts=2024-06-20T14:08:05.944963931Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=55f977c4cb7f626c5342ab787f58ea25 node_id=remotecfg duration=381.026µs 
ts=2024-06-20T14:08:05.94523395Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=55f977c4cb7f626c5342ab787f58ea25 node_id=livedebugging duration=81.339µs 
ts=2024-06-20T14:08:05.945497301Z level=info msg="applying non-TLS config to HTTP server" service=http 
ts=2024-06-20T14:08:05.94561431Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=55f977c4cb7f626c5342ab787f58ea25 node_id=http duration=180.013µs 
ts=2024-06-20T14:08:05.945807323Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=55f977c4cb7f626c5342ab787f58ea25 node_id=ui duration=14.334µs 
ts=2024-06-20T14:08:05.945978335Z level=info msg="finished node evaluation" controller_path=/ controller_id="" trace_id=55f977c4cb7f626c5342ab787f58ea25 node_id=cluster duration=15.335µs 
ts=2024-06-20T14:08:05.946120678Z level=info msg="finished complete graph evaluation" controller_path=/ controller_id="" trace_id=55f977c4cb7f626c5342ab787f58ea25 duration=3.941790076s ts=2024-06-20T14:08:05.946557709Z level=info msg="scheduling loaded components and services" 
ts=2024-06-20T14:08:05.946642048Z level=error msg="failed to start reporter" err="context canceled" 
ts=2024-06-20T14:08:05.947455438Z level=info msg="updating repository pull frequency, next pull attempt will be done according to the pullFrequency" config_path=/ config_id=import.git.grafana_cloud new_frequency=15m0s 
ts=2024-06-20T14:08:05.94777446Z level=info msg="starting cluster node" peers="" advertise_addr=127.0.0.1:12345 
panic: unaligned 64-bit atomic operation
    panic: unaligned 64-bit atomic operation

goroutine 180 [running]:
runtime/internal/atomic.panicUnaligned()
    /usr/local/go/src/runtime/internal/atomic/unaligned.go:8 +0x24
runtime/internal/atomic.Xadd64(0xd372f74, 0x1)
    /usr/local/go/src/runtime/internal/atomic/atomic_arm.s:265 +0x14
github.com/grafana/ckit/internal/lamport.(*Clock).Tick(...)
    /go/pkg/mod/github.com/grafana/ckit@v0.0.0-20230906125525-c046c99a5c04/internal/lamport/lamport.go:35
github.com/grafana/ckit.(*Node).changeState(0xd372f08, 0x2, 0xdf3c098)
    /go/pkg/mod/github.com/grafana/ckit@v0.0.0-20230906125525-c046c99a5c04/node.go:420 +0x124
github.com/grafana/ckit.(*Node).waitChangeState(0xd372f08, {0x80d6c38, 0xdf6e190}, 0x2)
    /go/pkg/mod/github.com/grafana/ckit@v0.0.0-20230906125525-c046c99a5c04/node.go:391 +0x84
github.com/grafana/ckit.(*Node).ChangeState(0xd372f08, {0x80d6c38, 0xdf6e190}, 0x2)
    /go/pkg/mod/github.com/grafana/ckit@v0.0.0-20230906125525-c046c99a5c04/node.go:372 +0x264
github.com/grafana/alloy/internal/service/cluster.(*Service).stop(0xdfd0930)
    /src/internal/service/cluster/cluster.go:330 +0x7c
panic({0x5ef0960, 0x80828e0})
    /usr/local/go/src/runtime/panic.go:770 +0xfc
runtime/internal/atomic.panicUnaligned()
    /usr/local/go/src/runtime/internal/atomic/unaligned.go:8 +0x24
runtime/internal/atomic.Xadd64(0xd372f74, 0x1)
    /usr/local/go/src/runtime/internal/atomic/atomic_arm.s:265 +0x14
github.com/grafana/ckit/internal/lamport.(*Clock).Tick(...)
    /go/pkg/mod/github.com/grafana/ckit@v0.0.0-20230906125525-c046c99a5c04/internal/lamport/lamport.go:35
github.com/grafana/ckit.(*Node).broadcastCurrentState(0xd372f08)
    /go/pkg/mod/github.com/grafana/ckit@v0.0.0-20230906125525-c046c99a5c04/node.go:298 +0xac
github.com/grafana/ckit.(*Node).Start(0xd372f08, {0x0, 0x0, 0x0})
    /go/pkg/mod/github.com/grafana/ckit@v0.0.0-20230906125525-c046c99a5c04/node.go:265 +0x208
github.com/grafana/alloy/internal/service/cluster.(*Service).Run(0xdfd0930, {0x80d6bf8, 0xe5c8240}, {0x80df534, 0xd0ff448})
    /src/internal/service/cluster/cluster.go:255 +0x3e0
github.com/grafana/alloy/internal/runtime/internal/controller.(*ServiceNode).Run(0xdffad20, {0x80d6bf8, 0xe5c8240})
    /src/internal/runtime/internal/controller/node_service.go:124 +0x48
github.com/grafana/alloy/internal/runtime/internal/controller.newTask.func1()
    /src/internal/runtime/internal/controller/scheduler.go:140 +0x8c
created by github.com/grafana/alloy/internal/runtime/internal/controller.newTask in goroutine 151
    /src/internal/runtime/internal/controller/scheduler.go:137 +0x140
mattdurham commented 2 months ago

Its unlikely we will support arm7, we have had a lot of issues with 32bit ARM. At this point we officially only support 64bit ARM. Likely related to this https://github.com/golang/go/issues/67077

mattdurham commented 2 months ago

Flow mode would likely fail with the same issue, static mode doesnt support clustering so does not hit this edge case.

imavroukakis commented 2 months ago

Hey @mattdurham thanks for the response. I understand that 32bit ARM might be a PITA to support. Having said that ARM SoCs are out there in abundance, especially in terms of embedded devices where one might want to use in the day job for Grafana Cloud ;)

In the instance that we are trialling we would not need to cluster anything quite frankly, can we skip it and hopefully never care again? If not, where would you suggest that I look to attempt to align whatever it is that it's panicking about?

mattdurham commented 2 months ago

We have had a lot of build issues over the years and runtime errors with 32bit from our various libraries we import.

Long term running node_exporter ( or other exporter) directly and scraping it from a non-32 bit system is likely the best solution.

imavroukakis commented 2 months ago

We have had a lot of build issues over the years and runtime errors with 32bit from our various libraries we import.

Long term running node_exporter ( or other exporter) directly and scraping it from a non-32 bit system is likely the best solution.

Due to a variety of network restrictions, scraping would not be possible :( but that's an idea to keep in mind anyway.

imavroukakis commented 2 months ago

I noticed a comment that was removed about running fieldalignment , happy to give any and all suggestions a go (no pun intended!)

hairyhenderson commented 2 months ago

I noticed a comment that was removed about running fieldalignment

Yeah, that was me - I looked at the Go bug listed above and was no longer certain this was the cause. I'm actually hunting a similar bug in another project and had been working on field alignment to fix it, but while there are a number of hits from the fieldalignment linter on ckit, I'm not at all convinced that simply fixing those would help.

imavroukakis commented 2 months ago

@hairyhenderson I took inspiration from your post and went ahead and changed the lamport clock implementation to use Uint64 in ckit and rebuilt

package lamport

import "sync/atomic"

var globalClock Clock

// Now returns the current Time. The time will not be changed.
func Now() Time { return globalClock.Now() }

// Tick increases the current Time by 1 and returns it.
func Tick() Time { return globalClock.Tick() }

// Observe ensures that the time past t by at least. Must be called when
// receiving a message from a remote machine to roughly synchronize
// clocks between processes.
func Observe(t Time) { globalClock.Observe(t) }

// Clock implements a lamport clock. The current time can be retrieved by
// calling Now. The Clock must be manually incremented either by calling Tick
// or Observe.
type Clock struct {
    time atomic.Uint64
}

// Time is the value of a Clock.
type Time uint64

// Now returns the current Time of c. The time is not changed.
func (c *Clock) Now() Time {
    return Time(c.time.Load())
}

// Tick increases c's Time by 1 and returns the new value.
func (c *Clock) Tick() Time {
    return Time(c.time.Add(1))
}

// Observe ensures that c's time is past t. Observe must be called when
// receiving a message from a remote machine that contains a Time, and is used
// to roughly synchronize clocks between machines.
func (c *Clock) Observe(t Time) {
Retry:
    // If t is behind us, we don't need to do anything.
    now := c.time.Load()
    if uint64(t) < now {
        return
    }

    // Move our clock past t.
    if !c.time.CompareAndSwap(now, uint64(t+1)) {
        // Retry if the CAS failed, which can happen when many observations are
        // happening concurrently. Either this will eventually succeed or another
        // call to Observe will move the current time past t and we'll be able
        // to do the early stop.
        goto Retry
    }
}

to my delight, it no longer panics

ts=2024-06-22T07:13:48.259514729Z level=info msg="starting cluster node" peers="" advertise_addr=127.0.0.1:12345
ts=2024-06-22T07:13:48.262359388Z level=info msg="peers changed" new_peers=inspect-val1xx-emmc
ts=2024-06-22T07:13:48.263384839Z level=info msg="now listening for http traffic" service=http addr=127.0.0.1:12345

whether it works in a clustered scenario, different kettle of fish.

hairyhenderson commented 2 months ago

@imavroukakis great news! Are you going to issue a PR to ckit for that?

imavroukakis commented 2 months ago

@hairyhenderson sure, if it will help!

imavroukakis commented 2 months ago

Now that the fix is merged, is another PR needed to pull in the change or would a bot take care of it?

mattdurham commented 2 months ago

Another PR to update.