Closed kb70 closed 2 years ago
Hi,
fatal error: concurrent map read and map write
This is probably from this recent addition of gnmi subscriptions. Telegraf may need to either sync the lookup map or put some mutexes in place.
When there's only one CPU available (1 socket, 1 core) everything work's as expected. As soon as ther's more than one CPU available (eg. 1 socket, 2 cores) a fatal error arises:
Out of curiosity, how are you controlling this? You said the only special setting is the memory_limit.
Hi,
fatal error: concurrent map read and map write
This is probably from this recent addition of gnmi subscriptions. Telegraf may need to either sync the lookup map or put some mutexes in place.
When there's only one CPU available (1 socket, 1 core) everything work's as expected. As soon as ther's more than one CPU available (eg. 1 socket, 2 cores) a fatal error arises:
Out of curiosity, how are you controlling this? You said the only special setting is the memory_limit.
In a test environment, where the docker container runs on a KVM virtualized host, where i can arbitrarily configure the number of sockets and cores.
@bewing as you are the author FYI - curious if you have seen this or if you are interested in looking to resolve this.
Thanks for the heads up -- My primary application was running telegraf directly on the device we were monitoring with gNMI, so I never ran into this.
I believe I just have to add a sync.Mutex
to update or read from gnmi.GNMI.lookup
?
@kb70 are you easily able to checkout the PR and see if the change resolves this issue?
Or download one of the build artifacts? https://github.com/influxdata/telegraf/pull/11008#issuecomment-1104241007
@bewing Thanks for the quick fix. I tested the PR (via PR build artifacts and 'quick & dirty' approach, ie. replacing the telegraf binary in the docker container(Telegraf 1.23.0-dfd54ceb)) and as far as I can tell it works now. Independent of the number of sockets or cores telegraf starts up without errors.
Thanks again for your work on this very valuable feature.
PS.: I'll test it on one of our production monitoring machines, ie. real hardware, later on, but won't be able to report about the outcome before tomorrow afternoon.
Thank you both for the fix and for testing this! I want to get this in our bugfix release Monday, so please do let us know the outcome, but I may merge anyway before then.
The fixed version runs without problems since about a day now on the virtualized testsystem as well as on bare-metal, both with multiple socket CPUs and cores.
Again thanks for the quick response and fix.
@kb70 thank you for testing!
Relevant telegraf.conf
Logs from Telegraf
System info
Telegraf 1.22.1, Debian Bullseye, Docker 20.10.14
Docker
Docker container is provisioned with ansible, only 'special' setting is memory_limit 512MB
Steps to reproduce
-) run Telegraf docker container with dynamic tagging configuration and more than 4 subscribed addresses.
When there's only one CPU available (1 socket, 1 core) everything work's as expected.
As soon as ther's more than one CPU available (eg. 1 socket, 2 cores) a fatal error arises: fatal error: concurrent map read and map write
Expected behavior
working telegraf container
Actual behavior
fatal error: concurrent map read and map write
Additional info
No response