influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.51k stars 5.55k forks source link

build: Building failing due to high memory usage during make test in release-1.31 branch #15657

Closed LarsStegman closed 1 month ago

LarsStegman commented 1 month ago

Relevant telegraf.conf

-

Logs from Telegraf

-

System info

Azure DevOps Pipelines Ubuntu Latest

Docker

No response

Steps to reproduce

  1. Checkout branch release-1.31 (tested up to commit 4b1b57c51d2d8fa88c17cd7de0f9b38dbd31428c
  2. Run make test in on an Azure DevOps ubuntu-latest agent
  3. Agent will run out of memory and shut down after about 35 minutes
  4. Logs indicate about 6GB of memory is used

Expected behavior

In previous versions of branch release-1.31 (commit fbfaba054e62413b6a0a90372281e687d9ff1238) the tests used around 1.5GB of memory, sometimes going up a bit, but usually going down again.

The image below is based on commit 749d433951e3491337c9d8eca7c2edbb594bc9e2 image

Actual behavior

make test claims around 6GB (the max on the build agent) and never releases it again. This build is based on commit 7896256f0a63aa0df565f8e433f9170984646039

image

Additional info

Commit at fault

I think I have narrowed the problem down to commit 7896256f0a63aa0df565f8e433f9170984646039. I think the regex changes might have something to do with it, but I am absolutely not sure about that. It would explain the memory usage though...

UPDATE: I suspect it has to do with the new dependency to parse IPV6 addresses. image

Unfortunately, I am not able to gather more stats from the build agent because Azure DevOps....

powersj commented 1 month ago

Azure DevOps....

What is the underlying OS and architecture that this is being run on?

on an Azure DevOps ubuntu-latest agent

Since you are using a "latest" image I am immediately suspect that something changed in the image itself. Did the version of Go change between runs? Or did the version of Ubuntu roll to the next LTS and have greater memory usage?

I think the regex changes might have something to do with it, but I am absolutely not sure about that. It would explain the memory usage though...

Can you remove all these tests and see if it makes a difference? Then add them back in one at a time?

I think I have narrowed the problem down to commit https://github.com/influxdata/telegraf/commit/7896256f0a63aa0df565f8e433f9170984646039

Looking at our various CI runs for the commits you call out I don't see an enormous change due to the one commit:

Looking at the original PR for that change I found the CI run on:

From the PR of the commit (https://github.com/influxdata/telegraf/commit/fbfaba054e62413b6a0a90372281e687d9ff1238) that you think was prior:

LarsStegman commented 1 month ago

Azure DevOps....

What is the underlying OS and architecture that this is being run on?


$ lsb_release -a
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:    22.04
Codename:   jammy

$ uname -m x86_64



Your CI also uses 22.04, right?

> > on an Azure DevOps ubuntu-latest agent
> 
> Since you are using a "latest" image I am immediately suspect that something changed in the image itself. Did the version of Go change between runs? Or did the version of Ubuntu roll to the next LTS and have greater memory usage?

I thought so as well, but I no longer think this is the problem. I ran builds for both commits with the "latest" image. Go version is 1.22.0. Setting it to 1.22.5 does not change anything.

> 
> Can you remove all these tests and see if it makes a difference? Then add them back in one at a time?
> 

Even when I remove all tests, it seems that memory usage stays high. The problems stays until I completely remove the new dependency. The dependency does have an `init()`, but I don't really see how that could use so much memory. https://github.com/seancfoley/ipaddress-go/blob/836e812b566fe76b374de1bf1315979bc7ce7479/ipaddr/addrstrparam/ipaddrstrparams.go#L183

> > I think I have narrowed the problem down to commit [7896256](https://github.com/influxdata/telegraf/commit/7896256f0a63aa0df565f8e433f9170984646039)
> 
> Looking at our various CI runs for the commits you call out I don't see an enormous change due to the one commit:
> 
> Looking at the original PR for that change I found the CI run on:
> [...]

Yeah, that doesn't seem different.
powersj commented 1 month ago

Are you running make test or some other command?

Your CI also uses 22.04, right?

The linux tests are run in a custom Docker container based on the Golang base image, which uses Debian Bookworm.

I thought so as well, but I no longer think this is the problem. I ran builds for both commits with the "latest" image. Go version is 1.22.0. Setting it to 1.22.5 does not change anything.

Ah! Thanks for trying that out to confirm.

Even when I remove all tests, it seems that memory usage stays high. The problems stays until I completely remove the new dependency. The dependency does have an init(), but I don't really see how that could use so much memory.

I went and ran master to get a pprof picture of the heap. I do see the ipaddr library grabbing 2MB.

image

From the output above it doesn't look like you've run many tests yet, as I would expect to see more output like:

? github.com/influxdata/telegraf [no test files] ok github.com/influxdata/telegraf/agent 1.392s ? github.com/influxdata/telegraf/internal/choice [no test files]

And in your screenshot I only see the very first line.

I thought maybe the additional init usage across 100s of tests would cause some sort of issue, but that seems unlikely here.

One option, and we do this for running the linter (not tests), is to use a combination of GOMEMLIMIT and GOGC to cap and reduce memory usage. See https://github.com/influxdata/telegraf/pull/14639 and https://weaviate.io/blog/gomemlimit-a-game-changer-for-high-memory-applications

LarsStegman commented 1 month ago

Are you running make test or some other command?

Just make test at this step

The linux tests are run in a custom Docker container based on the Golang base image, which uses Debian Bookworm.

Interesting, might want to look more into that, but changing the build pipeline is not the most fun and rewarding job haha.

Even when I remove all tests, it seems that memory usage stays high. The problems stays until I completely remove the new dependency. The dependency does have an init(), but I don't really see how that could use so much memory.

From the output above it doesn't look like you've run many tests yet, as I would expect to see more output like:

? github.com/influxdata/telegraf [no test files] ok github.com/influxdata/telegraf/agent 1.392s ? github.com/influxdata/telegraf/internal/choice [no test files]

And in your screenshot I only see the very first line.

I thought maybe the additional init usage across 100s of tests would cause some sort of issue, but that seems unlikely here.

Yes, the testing process already causes the machine to stall before any of the tests are actually run. I feel like a large amount of tests are being run in parallel which causes the memory usage to grow.

One option, and we do this for running the linter (not tests), is to use a combination of GOMEMLIMIT and GOGC to cap and reduce memory usage. See #14639 and https://weaviate.io/blog/gomemlimit-a-game-changer-for-high-memory-applications

Oh interesting, I didn't know about those env vars. Setting this value fixes the build for me! It takes a bit for all the tests to actually start, but at least it now runs!