Open automatedops opened 2 years ago
Another note. After fixing TLS path in my case. Metrics generate wasn't able to send metrics properly. I have to delete wal files on all my pods to recover it. Maybe calling Creating WAL
multiple times caused some kind of corruption on the wal files.
Nice find. I agree with your thoughts about exiting the process on failure to load the TLS file. We will also look into the issue with WAL corruption.
Okay, I think we are not cleaning up resources properly after running into an error. The remote write config is loaded after the WAL has been created:
If this errors, we just exit the function but do not destroy the newly created WAL. When a second batch is pushed we try to create the WAL again and this causes the duplicate registration.
I'll check if we can validate the remote write config in advance. If not, we should just make sure we clean up new resources before returning the error.
About the corrupted WAL: creating the WAL multiple times should be fine. The next invocations will reuse the same directory. I'm thinking the panic might have disrupted some async process, leaving the WAL in an invalid state.
So this is an issue upstream in Prometheus: if the first attempt to create the remote write structure fails, not all resources are cleaned up correctly and the second attempt will panic when registering metrics. We can't fix this from Tempo code. I've created an issue upstream: https://github.com/prometheus/prometheus/issues/10779
@kvrhdn in the event that we can't create the WAL or whatever remote write structure is having issues, should we just exit cleanly instead of trying again?
Could this impact us during normal operations? or only on startup?
We create these structures when we first receive data for that tenant. Since we don't know in advance which tenants are active, we can't create them at startup. We can try to validate the config at start up, but maintaining this ourselves will be tricky...
We could deliberately exit the process when creating the remote write structures fails, but I don't know if we can do that in a clean way. Btw, an error in the remote write config will most likely impact every tenant, so in practice the first tenant that sends data will trigger this error.
I have configured the metrics generator to write to Prometheus remote-write storage. However, no metrics are received to Prometheus, although some data is written to the Prometheus wal directory.
As in the panic log above, I am seeing a similar message as below, when starting the metric generator. Could this issue that outputs the below message, cause the metric generator not to send the metrics to Prometheus?
"level=info ts=2022-05-25T19:24:59.700977765Z caller=basic_lifecycler.go:260 msg="instance not found in the ring" instance=ivapp14... ring=metrics-generator level=info ts=2022-05-25T19:24:59.701209972Z caller=app.go:284 msg="Tempo started""
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.
Describe the bug When I configure metrics generator on Grafana Tempo, I had a misconfigured TLS path.
But metrics generator pod boots up and crashed immediately.
In the full crash log, it show
creating WAL
multiple times, which is leading to this crash. I think it will be helpful if we can exit the program on any misconfigured remote write endpoints.Feel free to suggest other error handling approach.
To Reproduce Steps to reproduce the behavior:
panic: duplicate metrics collector registration attempted
Expected behavior
Environment:
Additional Context Sample configuration
Full Panic log