google / mtail

extract internal monitoring data from application logs for collection in a timeseries database
Apache License 2.0
3.86k stars 378 forks source link

mtail fails to scrape with > 70k metrics presented. #903

Open JWThorne opened 4 months ago

JWThorne commented 4 months ago

We find that with our current deployment, even though the scrape time is under 2.5 seconds, HTTP GET on the /metrics endpoint will just fail if mtail has more than 70k metrics. There are no errors in the logs, no issues, just a failed response and a connection close after 2 seconds. Reducing the metric count appears to restore operation

However, we need more metrics.

jaqx0r commented 4 months ago

Which version please?

https://github.com/google/mtail/blob/main/docs/Troubleshooting.md#reporting-a-problem

Does it look like mtail has also stopped processing lines when a GET is being processed?

terencehonles commented 3 months ago

This may indeed be related to the issue I was seeing and the change https://github.com/google/mtail/pull/908. I was testing with the /json handler and it does emit the headers and then stream the response. I had noticed that testing /metrics was returning an empty response (when I was in the browser).

For /json I was seeing a E0805 16:26:05.490112 435505 json.go:27] write tcp [::1]:3903->[::1]:55250: i/o timeout message.

From curl with verbose logging I was seeing:

* transfer closed with outstanding read data remaining
* Closing connection
curl: (18) transfer closed with outstanding read data remaining

When rebuilding mtail without #908 (I need #906 for my mtail program) and testing /metrics again, I do see that there's nothing written to the logs, and curl looks like:

* Empty reply from server
* Closing connection
curl: (52) Empty reply from server

@JWThorne you can probably look at one of the other exporters to confirm you're seeing partial output from them, and you can either build from the source or wait till #908 is released

terencehonles commented 3 months ago

the /metrics endpoint will just fail if mtail has more than 70k metrics

This is the number of outputted metrics or the number of log lines you're processing?

For my case we had a number of counters with a large number of labels, so it was generating a large JSON payload and hitting the timeout.