Closed LarsStegman closed 1 month ago
Azure DevOps....
What is the underlying OS and architecture that this is being run on?
on an Azure DevOps ubuntu-latest agent
Since you are using a "latest" image I am immediately suspect that something changed in the image itself. Did the version of Go change between runs? Or did the version of Ubuntu roll to the next LTS and have greater memory usage?
I think the regex changes might have something to do with it, but I am absolutely not sure about that. It would explain the memory usage though...
Can you remove all these tests and see if it makes a difference? Then add them back in one at a time?
I think I have narrowed the problem down to commit https://github.com/influxdata/telegraf/commit/7896256f0a63aa0df565f8e433f9170984646039
Looking at our various CI runs for the commits you call out I don't see an enormous change due to the one commit:
Looking at the original PR for that change I found the CI run on:
From the PR of the commit (https://github.com/influxdata/telegraf/commit/fbfaba054e62413b6a0a90372281e687d9ff1238) that you think was prior:
linux/amd64 - 55% of 8Gb or ~4.5Gb
windows/amd64 - 25% of 30Gb or ~7.5Gb
macos/arm64 - 58% of 6Gb or ~3.5Gb
Looking at the current nightly build:
linux/amd64 - 60% of 8Gb or ~5Gb
windows/amd64 - 25% of 30Gb or ~7.5Gb
macos/arm64 - 50% of 6Gb or ~3Gb
Azure DevOps....
What is the underlying OS and architecture that this is being run on?
$ lsb_release -a Distributor ID: Ubuntu Description: Ubuntu 22.04.4 LTS Release: 22.04 Codename: jammy
$ uname -m x86_64
Your CI also uses 22.04, right?
> > on an Azure DevOps ubuntu-latest agent
>
> Since you are using a "latest" image I am immediately suspect that something changed in the image itself. Did the version of Go change between runs? Or did the version of Ubuntu roll to the next LTS and have greater memory usage?
I thought so as well, but I no longer think this is the problem. I ran builds for both commits with the "latest" image. Go version is 1.22.0. Setting it to 1.22.5 does not change anything.
>
> Can you remove all these tests and see if it makes a difference? Then add them back in one at a time?
>
Even when I remove all tests, it seems that memory usage stays high. The problems stays until I completely remove the new dependency. The dependency does have an `init()`, but I don't really see how that could use so much memory. https://github.com/seancfoley/ipaddress-go/blob/836e812b566fe76b374de1bf1315979bc7ce7479/ipaddr/addrstrparam/ipaddrstrparams.go#L183
> > I think I have narrowed the problem down to commit [7896256](https://github.com/influxdata/telegraf/commit/7896256f0a63aa0df565f8e433f9170984646039)
>
> Looking at our various CI runs for the commits you call out I don't see an enormous change due to the one commit:
>
> Looking at the original PR for that change I found the CI run on:
> [...]
Yeah, that doesn't seem different.
Are you running make test
or some other command?
Your CI also uses 22.04, right?
The linux tests are run in a custom Docker container based on the Golang base image, which uses Debian Bookworm.
I thought so as well, but I no longer think this is the problem. I ran builds for both commits with the "latest" image. Go version is 1.22.0. Setting it to 1.22.5 does not change anything.
Ah! Thanks for trying that out to confirm.
Even when I remove all tests, it seems that memory usage stays high. The problems stays until I completely remove the new dependency. The dependency does have an init(), but I don't really see how that could use so much memory.
I went and ran master to get a pprof picture of the heap. I do see the ipaddr library grabbing 2MB.
From the output above it doesn't look like you've run many tests yet, as I would expect to see more output like:
? github.com/influxdata/telegraf [no test files] ok github.com/influxdata/telegraf/agent 1.392s ? github.com/influxdata/telegraf/internal/choice [no test files]
And in your screenshot I only see the very first line.
I thought maybe the additional init usage across 100s of tests would cause some sort of issue, but that seems unlikely here.
One option, and we do this for running the linter (not tests), is to use a combination of GOMEMLIMIT and GOGC to cap and reduce memory usage. See https://github.com/influxdata/telegraf/pull/14639 and https://weaviate.io/blog/gomemlimit-a-game-changer-for-high-memory-applications
Are you running
make test
or some other command?
Just make test
at this step
The linux tests are run in a custom Docker container based on the Golang base image, which uses Debian Bookworm.
Interesting, might want to look more into that, but changing the build pipeline is not the most fun and rewarding job haha.
Even when I remove all tests, it seems that memory usage stays high. The problems stays until I completely remove the new dependency. The dependency does have an init(), but I don't really see how that could use so much memory.
From the output above it doesn't look like you've run many tests yet, as I would expect to see more output like:
? github.com/influxdata/telegraf [no test files] ok github.com/influxdata/telegraf/agent 1.392s ? github.com/influxdata/telegraf/internal/choice [no test files]
And in your screenshot I only see the very first line.
I thought maybe the additional init usage across 100s of tests would cause some sort of issue, but that seems unlikely here.
Yes, the testing process already causes the machine to stall before any of the tests are actually run. I feel like a large amount of tests are being run in parallel which causes the memory usage to grow.
One option, and we do this for running the linter (not tests), is to use a combination of GOMEMLIMIT and GOGC to cap and reduce memory usage. See #14639 and https://weaviate.io/blog/gomemlimit-a-game-changer-for-high-memory-applications
Oh interesting, I didn't know about those env vars. Setting this value fixes the build for me! It takes a bit for all the tests to actually start, but at least it now runs!
Relevant telegraf.conf
Logs from Telegraf
System info
Azure DevOps Pipelines Ubuntu Latest
Docker
No response
Steps to reproduce
4b1b57c51d2d8fa88c17cd7de0f9b38dbd31428c
make test
in on an Azure DevOpsubuntu-latest
agentExpected behavior
In previous versions of branch
release-1.31
(commit fbfaba054e62413b6a0a90372281e687d9ff1238) the tests used around 1.5GB of memory, sometimes going up a bit, but usually going down again.The image below is based on commit 749d433951e3491337c9d8eca7c2edbb594bc9e2
Actual behavior
make test
claims around 6GB (the max on the build agent) and never releases it again. This build is based on commit 7896256f0a63aa0df565f8e433f9170984646039Additional info
Commit at fault
I think I have narrowed the problem down to commit 7896256f0a63aa0df565f8e433f9170984646039. I think the regex changes might have something to do with it, but I am absolutely not sure about that. It would explain the memory usage though...
UPDATE: I suspect it has to do with the new dependency to parse IPV6 addresses.
Unfortunately, I am not able to gather more stats from the build agent because Azure DevOps....