lukemonahan / splunk_modinput_prometheus

A Splunk modular input for ingesting Prometheus metrics
Apache License 2.0
44 stars 23 forks source link

"Client.Timeout exceeded while awaiting headers" #28

Open akira-kurogane opened 2 years ago

akira-kurogane commented 2 years ago

Greetings.I installed v1.0.0 in Splunk Enterprise v9.0.0 today, but I'm afraid I can't get past this issue.

My config in /etc/apps/modinput_prometheus/local/inputs.conf is

[prometheus://kfk-akira-1]
URI = http://hostabc.my.domain:8297/metrics
index = prometheus
sourcetype = prometheus:metric
interval = 10
disabled = 0

The index "prometheus" has already been created, as "metrics" type.

Unfortunately nothing is being collected. Instead the following message appears in /var/log/splunkd.log, once per interval period.

06-27-2022 08:08:04.049 +0000 ERROR ExecProcessor [2378 ExecProcessor] - message from "/opt/splunk/etc/apps/modinput_prometheus/linux_x86_64/bin/prometheus"
  2022/06/27 08:08:04 Get http://hostabc.my.domain:8297/metrics: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

If I use curl to fetch http://hostabc.my.domain:8297/metrics from the shell on the same server there is no problem though.

root@splunkserver:/opt/splunk/etc/apps/modinput_prometheus# curl http://hostabc.my.domain:8297/metrics | head
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 387.12
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.656312303304E9
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
...

Would anyone have any clues?

ltmon commented 2 years ago

The obvious problem would be network reachability, however you have been able to rule that out pretty well with curl.

I'm wondering if there's a possibility SELinux or similar is blocking the connection whilst running under Splunk?

Otherwise: I'll re-test with Splunk 9 today as latest testing was Splunk 8.x just in case anything unexpected was introduced there.

akira-kurogane commented 2 years ago

Thanks Luke.

I considered the SELinux idea too, but nothing has been configured for that.

Beyond the curl test made at the same time the connection timeout was being reported I can add that the same Splunk process can fetch JMX monitoring data directly from the same Kafka server process using the "Monitoring of Java Virtual Machines with JMX" app (the one with app id SPLUNK4JMX). It was disabled at the time I tested yesterday, but I've confirmed it still works by re-enabling it via the data inputs config screen, i.e. without restart of the Splunk server.

akira-kurogane commented 2 years ago

I installed a v8.2.6 Splunk on a fresh server today. I'm afraid it's the same issue. Happy to share logs and try other diagnostic steps if it will help.

ltmon commented 2 years ago

I'm out of any simple ideas. I'll see if I can find time to replicate somewhat, or possibly add some further logging in the connector to see if we can figure it out. Can you let me know what your scraping target (exporter) is and what Linux OS you are running Splunk on?

akira-kurogane commented 2 years ago

Sorry, I shouldn't have just posted that v8 was the same and just left it there. Something was eliminated and that's a bit of progress, but not much.

But once I was less worried it was a version or security 'feature' of v9 Splunk I thought it might be worthwhile to sniff the traffic just in case a connection was really happening (depsite the "client timeout" message). I did and it's interesting.

Request-response TCP conversation from Splunk with this add-on dumped in ascii:

akira@kfk-akira-1:~$ f=a.modinput_splunk_req.pcap
akira@kfk-akira-1:~$ for stream in `tshark -r "$f" -2 -R "tcp and (http.request or http.response)" -T fields -e tcp.stream | sort -n | uniq`; do
>     echo "==========BEGIN REQUEST=========="
>     tshark -q -r "$f" -z follow,tcp,ascii,$stream;
>     echo "==========END REQUEST=========="
> done;
==========BEGIN REQUEST==========

===================================================================
Follow: tcp,ascii
Filter: tcp.stream eq 0
Node 0: 10.146.0.6:48606
Node 1: 10.146.0.5:8297
156
GET /metrics HTTP/1.1
Host: kfk-akira-1.asia-northeast1-b.c.test-stretched-cluster.internal:8297
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip

    164
HTTP/1.1 200 OK
Content-encoding: gzip
Date: Wed, 29 Jun 2022 02:11:14 GMT
Transfer-encoding: chunked
Content-type: text/plain; version=0.0.4; charset=utf-8

===================================================================
==========END REQUEST==========

I.e. The response is HTTP 200 OK, but the content was not returned. Or rather it moves into chunked mode and then never sends any chunks?

For comparison this is the healthy request-response TCP conversation caused by a curl from the same server:

akira@kfk-akira-1:~$ f=b.curl_req.pcap
akira@kfk-akira-1:~$ for stream in `tshark -r "$f" -2 -R "tcp and (http.request or http.response)" -T fields -e tcp.stream | sort -n | uniq`; do     echo "==========BEGIN REQUEST==========";     tshark -q -r "$f" -z follow,tcp,ascii,$stream;     echo "==========END REQUEST=========="; done;
==========BEGIN REQUEST==========

===================================================================
Follow: tcp,ascii
Filter: tcp.stream eq 0
Node 0: 10.146.0.7:37812
Node 1: 10.146.0.5:8297
139
GET /metrics HTTP/1.1
Host: kfk-akira-1.asia-northeast1-b.c.test-stretched-cluster.internal:8297
User-Agent: curl/7.58.0
Accept: */*

    136
HTTP/1.1 200 OK
Date: Wed, 29 Jun 2022 02:12:19 GMT
Content-type: text/plain; version=0.0.4; charset=utf-8
Content-length: 143328

    7040
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 12468.06
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.656312303304E9
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 216.0
...
...
...
...
jvm_memory_pool_allocated_bytes
    224
_created{pool="Compressed Class Space",} 1.656312304834E9
jvm_memory_pool_allocated_bytes_created{pool="Metaspace",} 1.656312304834E9
jvm_memory_pool_allocated_bytes_created{pool="CodeHeap 'non-nmethods'",} 1.656312304834E9

===================================================================
==========END REQUEST==========

I've adjust the firewall in this test VPC with the test servers. The test server above can be accessed globally for the moment with http://34.85.5.234:8297/metrics.

For clarification the prometheus exporter service is a kafka broker with the JMX-Prometheus exporter javaagent in it.

akira-kurogane commented 2 years ago

Here's another comparison observation to add.

We can request the promethues metrics using a normal browser of course. Using Chrome on my laptop to fetch http://34.85.5.234:8297/metrics I get the full result. It sends these HTTP request headers. The response HTTP headers are identical.

GET /metrics HTTP/1.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Encoding: gzip, deflate
Accept-Language: en-AU,en;q=0.9,ja-JP;q=0.8,ja;q=0.7,en-GB;q=0.6,en-US;q=0.5
Cache-Control: max-age=0
Connection: keep-alive
DNT: 1
Host: 34.85.5.234:8297
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
sec-gpc: 1
ltmon commented 2 years ago

I've added explicit Accept headers, as that may cause the behaviour of the exporter to change. That's in the master branch now: can you build yourself?

Here's a copy of the executable with this change if not (put in linux_x86_64/bin):

https://drive.google.com/drive/folders/1Q9E9x3L6ReiW0ulutQuCHV8XiFxwsQkK

akira-kurogane commented 2 years ago

Hi Luke. That's really nice of you.

I could use the binary just as it was. Unfortunately it didn't lead to the jmx_exporter http service replying in non-chunked way.

akira@kfk-akira-1:~$ f=a.modinput_acct_hdr_fix.pcap 
akira@kfk-akira-1:~$ for stream in `tshark -r "$f" -2 -R "tcp and (http.request or http.response)" -T fields -e tcp.stream | sort -n | uniq`; do     echo "==========BEGIN REQUEST==========";     tshark -q -r "$f" -z follow,tcp,ascii,$stream;     echo "==========END REQUEST=========="; done;
==========BEGIN REQUEST==========

===================================================================
Follow: tcp,ascii
Filter: tcp.stream eq 0
Node 0: 10.146.0.6:37950
Node 1: 10.146.0.5:8297
299
GET /metrics HTTP/1.1
Host: kfk-akira-1.asia-northeast1-b.c.test-stretched-cluster.internal:8297
User-Agent: Go-http-client/1.1
Accept: application/openmetrics-text;version=1.0.0,application/openmetrics-text;version=0.0.1;q=0.75,text/plain;version=0.0.4;q=0.5,*/*;q=0.1
Accept-Encoding: gzip

    182
HTTP/1.1 200 OK
Content-encoding: gzip
Date: Wed, 29 Jun 2022 04:31:24 GMT
Transfer-encoding: chunked
Content-type: application/openmetrics-text; version=1.0.0; charset=utf-8

===================================================================
==========END REQUEST==========

Looking at the request headers from the curl case, which got a non-chunked response

I need to completely restudy what I once knew about the chunked transfer protocol ...

akira-kurogane commented 2 years ago

Or maybe it's just the Connection: keep-alive needed, if accepting the chunked way.

This is what the chrome web browser does:

kira@kfk-akira-1:~$ f=c.chrome_page_read.pcap 
akira@kfk-akira-1:~$ for stream in `tshark -r "$f" -2 -R "tcp and (http.request or http.response)" -T fields -e tcp.stream | sort -n | uniq`; do     echo "==========BEGIN REQUEST==========";     tshark -q -r "$f" -z follow,tcp,ascii,$stream;     echo "==========END REQUEST=========="; done;
==========BEGIN REQUEST==========

===================================================================
Follow: tcp,ascii
Filter: tcp.stream eq 0
Node 0: 157.107.9.43:35928
Node 1: 10.146.0.5:8297
519
GET /metrics HTTP/1.1
Host: 34.85.5.234:8297
Connection: keep-alive
Cache-Control: max-age=0
DNT: 1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Encoding: gzip, deflate
Accept-Language: en-AU,en;q=0.9,ja-JP;q=0.8,ja;q=0.7,en-GB;q=0.6,en-US;q=0.5
sec-gpc: 1

    164
HTTP/1.1 200 OK
Content-encoding: gzip
Date: Wed, 29 Jun 2022 05:02:22 GMT
Transfer-encoding: chunked
Content-type: text/plain; version=0.0.4; charset=utf-8

    2804
1000
...........][s...~..@y..n.....V.A..D.u   ... 
<GZIPPED BINARY DATA, PRESUMABLY>
<ITS NORMAL PROMETHEUS RESPONSE WHEN UNZIPPED>
....
...
...
ltmon commented 2 years ago

Keep-Alive should already be completely active.

I've tried a new change: explicitly set the Transfer-Encoding header on the request. You can get that version from the same GDrive link.

akira-kurogane commented 2 years ago

Thanks. I didn't see the change in effect unfortunately.

akira@kfk-akira-1:~$ f=b.second_fix_test.pcap 
akira@kfk-akira-1:~$ for stream in `tshark -r "$f" -2 -R "tcp and (http.request or http.response)" -T fields -e tcp.stream | sort -n | uniq`; do     echo "==========BEGIN REQUEST==========";     tshark -q -r "$f" -z follow,tcp,ascii,$stream;     echo "==========END REQUEST=========="; done;
==========BEGIN REQUEST==========

===================================================================
Follow: tcp,ascii
Filter: tcp.stream eq 0
Node 0: 10.146.0.6:45338
Node 1: 10.146.0.5:8297
299
GET /metrics HTTP/1.1
Host: kfk-akira-1.asia-northeast1-b.c.test-stretched-cluster.internal:8297
User-Agent: Go-http-client/1.1
Accept: application/openmetrics-text;version=1.0.0,application/openmetrics-text;version=0.0.1;q=0.75,text/plain;version=0.0.4;q=0.5,*/*;q=0.1
Accept-Encoding: gzip

    182
HTTP/1.1 200 OK
Content-encoding: gzip
Date: Wed, 29 Jun 2022 05:30:01 GMT
Transfer-encoding: chunked
Content-type: application/openmetrics-text; version=1.0.0; charset=utf-8

===================================================================
==========END REQUEST==========

I double-checked I wasn't using the older binary. I'm not - currently using the one with md5sum
7b8d3fdaf14b06c21c51720187130b52.

akira-kurogane commented 2 years ago

Keep-Alive should already be completely active.

As in TCP keep-alive? I didn't track or examine that yet t.b.h.

I was only noticing the Connection: keep-alive HTTP header in the Chrome web browser request. But this is just me noticing differences in text in front of me. When I look at docs about chunked transfer encoding it doesn't specify that this HTTP 1.x header is required.

akira-kurogane commented 2 years ago

I tried adding the "Connect: keep-alive" header, purely as hack. It didn't change anything.

For another comparison the this program works fine with the well-known node_exporter http service - capture shown below. (This capture has the "Connect: keep-alive" header, but it's successful without it too.)

GET /metrics HTTP/1.1
Host: kfk-akira-1.asia-northeast1-b.c.hsbc-stretched-cluster.internal:9100
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip
Connect: keep-alive

    4096
HTTP/1.1 200 OK
Content-Encoding: gzip
Content-Type: text/plain; version=0.0.4; charset=utf-8
Date: Wed, 29 Jun 2022 12:10:51 GMT
Transfer-Encoding: chunked

800
"YU.[.Te..W....5O.:.r..F].... <THE REST OF GZIP'ED BODY>

The chunk size (800 bytes in this case) and the body is simply sent, where it hangs or exits (not sure which) when its jmx_exporter

So there's something about the interplay between the http server of jmx_exporter and this program. In a matrix of two servers (node_exporter (golang) and jmx_exporter (Java) and three clients (ordinary web browser, curl, and this program) this is the only failing one. I'm out of ideas for today, need a walk or something to make some more.

akira-kurogane commented 2 years ago

Ahah, looks like it timeout. The response was being delivered at ~550ms exceeding the currently hard-coded limit

prometheus/prometheus.go:

client := &http.Client{Transport: tr, Timeout: time.Duration(500000 * time.Microsecond)}

After tweaking it I got the full response, and I am filling my Splunk index with the metrics already.

I'll share a wrap-up later, in daylight hours.

ltmon commented 2 years ago

I'm surprised at the time it takes for that exporter to respond -- thought 500ms would be fine for any. Best option will be to make it configurable, so I'll setup that when I can.

akira-kurogane commented 2 years ago

An alternative I'd like to suggest is keep 500ms (or less) as the initial TCP timeout, set something higher say 3s as the max response receiving window.

The default output of the tshark CLI shows timing, which is handy. Below is a typical response seen with the v1.0.0 modinput_prometheus exe, and all the header modificiations we tried yesterday. 10.146.0.6 is the splunk server running with modinput_prometheus, 10.146.0.5 is the kafka server with jmx_exporter javaagent daemon thread running in it and serving on port 8297.

akira@kfk-akira-1:~$ tshark -r a.monahan.pcap 
    1 0.000000000   10.146.0.6 → 10.146.0.5   TCP 74 48606 → 8297 [SYN] Seq=0 Win=65320 Len=0 MSS=1420 SACK_PERM=1 TSval=2054752694 TSecr=0 WS=128
    2 0.000033258   10.146.0.5 → 10.146.0.6   TCP 74 8297 → 48606 [SYN, ACK] Seq=0 Ack=1 Win=64768 Len=0 MSS=1420 SACK_PERM=1 TSval=3236123067 TSecr=2054752694 WS=128
    3 0.000593839   10.146.0.6 → 10.146.0.5   TCP 66 48606 → 8297 [ACK] Seq=1 Ack=1 Win=65408 Len=0 TSval=2054752696 TSecr=3236123067
    4 0.000860427   10.146.0.6 → 10.146.0.5   HTTP 222 GET /metrics HTTP/1.1 
    5 0.000874242   10.146.0.5 → 10.146.0.6   TCP 66 8297 → 48606 [ACK] Seq=1 Ack=157 Win=64640 Len=0 TSval=3236123068 TSecr=2054752696
    6 0.494948139   10.146.0.6 → 10.146.0.5   TCP 66 48606 → 8297 [FIN, ACK] Seq=157 Ack=1 Win=65408 Len=0 TSval=2054753190 TSecr=3236123068
    7 0.535274953   10.146.0.5 → 10.146.0.6   TCP 66 8297 → 48606 [ACK] Seq=1 Ack=158 Win=64640 Len=0 TSval=3236123603 TSecr=2054753190
    8 0.569527987   10.146.0.5 → 10.146.0.6   TCP 230 HTTP/1.1 200 OK  [TCP segment of a reassembled PDU]
    9 0.569652653   10.146.0.6 → 10.146.0.5   TCP 54 48606 → 8297 [RST] Seq=158 Win=0 Len=0

Note that TCP back and forth is immediate. TCP handshake(?) stuff is acked and the HTTP GET packet is sent in only 0.860 ms. The HTTP response body is sent back at ~560 ms. The OS receives the first TCP window which had the headers, which we saw, but the process (modinput_prometheus' prometheus exe) had already shut the door apparently.

Just for interest's sake the same when I expanded the timeout was this:

akira@kfk-akira-1:~$ tshark -r e.1500ms_timeout.pcap 
    1 0.000000000   10.146.0.6 → 10.146.0.5   TCP 74 37518 → 8297 [SYN] Seq=0 Win=65320 Len=0 MSS=1420 SACK_PERM=1 TSval=2105318544 TSecr=0 WS=128
    2 0.000028538   10.146.0.5 → 10.146.0.6   TCP 74 8297 → 37518 [SYN, ACK] Seq=0 Ack=1 Win=64768 Len=0 MSS=1420 SACK_PERM=1 TSval=3286688917 TSecr=2105318544 WS=128
    3 0.000793135   10.146.0.6 → 10.146.0.5   TCP 66 37518 → 8297 [ACK] Seq=1 Ack=1 Win=65408 Len=0 TSval=2105318546 TSecr=3286688917
    4 0.001064929   10.146.0.6 → 10.146.0.5   HTTP 216 GET /metrics HTTP/1.1 
    5 0.001073132   10.146.0.5 → 10.146.0.6   TCP 66 8297 → 37518 [ACK] Seq=1 Ack=151 Win=64640 Len=0 TSval=3286688918 TSecr=2105318546
    6 0.554189872   10.146.0.5 → 10.146.0.6   TCP 202 HTTP/1.1 200 OK  [TCP segment of a reassembled PDU]
    7 0.554339979   10.146.0.6 → 10.146.0.5   TCP 66 37518 → 8297 [ACK] Seq=151 Ack=137 Win=65280 Len=0 TSval=2105319099 TSecr=3286689472
    8 0.554357411   10.146.0.5 → 10.146.0.6   TCP 7106 8297 → 37518 [PSH, ACK] Seq=137 Ack=151 Win=64640 Len=7040 TSval=3286689472 TSecr=2105318546 [TCP segment of a reassembled PDU]
    9 0.554368649   10.146.0.5 → 10.146.0.6   TCP 5698 8297 → 37518 [PSH, ACK] Seq=7177 Ack=151 Win=64640 Len=5632 TSval=3286689472 TSecr=2105318546 [TCP segment of a reassembled PDU]
   10 0.554394504   10.146.0.5 → 10.146.0.6   TCP 2882 8297 → 37518 [PSH, ACK] Seq=12809 Ack=151 Win=64640 Len=2816 TSval=3286689472 TSecr=2105319099 [TCP segment of a reassembled PDU]
   11 0.554445624   10.146.0.6 → 10.146.0.5   TCP 66 37518 → 8297 [ACK] Seq=151 Ack=7177 Win=61056 Len=0 TSval=2105319100 TSecr=3286689472
   12 0.554445678   10.146.0.6 → 10.146.0.5   TCP 66 37518 → 8297 [ACK] Seq=151 Ack=12809 Win=56960 Len=0 TSval=2105319100 TSecr=3286689472
   13 0.554455747   10.146.0.5 → 10.146.0.6   TCP 11330 8297 → 37518 [PSH, ACK] Seq=15625 Ack=151 Win=64640 Len=11264 TSval=3286689472 TSecr=2105319100 [TCP segment of a reassembled PDU]
   14 0.554461476   10.146.0.5 → 10.146.0.6   TCP 2882 8297 → 37518 [PSH, ACK] Seq=26889 Ack=151 Win=64640 Len=2816 TSval=3286689472 TSecr=2105319100 [TCP segment of a reassembled PDU]
   15 0.554466410   10.146.0.5 → 10.146.0.6   TCP 11330 8297 → 37518 [PSH, ACK] Seq=29705 Ack=151 Win=64640 Len=11264 TSval=3286689472 TSecr=2105319100 [TCP segment of a reassembled PDU]
   16 0.554472932   10.146.0.6 → 10.146.0.5   TCP 66 37518 → 8297 [ACK] Seq=151 Ack=15625 Win=54912 Len=0 TSval=2105319100 TSecr=3286689472
   17 0.554475502   10.146.0.5 → 10.146.0.6   TCP 4290 8297 → 37518 [PSH, ACK] Seq=40969 Ack=151 Win=64640 Len=4224 TSval=3286689472 TSecr=2105319100 [TCP segment of a reassembled PDU]
   18 0.554478319   10.146.0.5 → 10.146.0.6   TCP 1474 8297 → 37518 [ACK] Seq=45193 Ack=151 Win=64640 Len=1408 TSval=3286689472 TSecr=2105319100 [TCP segment of a reassembled PDU]
   19 0.554594156   10.146.0.6 → 10.146.0.5   TCP 66 37518 → 8297 [ACK] Seq=151 Ack=26889 Win=48768 Len=0 TSval=2105319100 TSecr=3286689472
   20 0.554594208   10.146.0.6 → 10.146.0.5   TCP 66 37518 → 8297 [ACK] Seq=151 Ack=29705 Win=46720 Len=0 TSval=2105319100 TSecr=3286689472
   21 0.554594248   10.146.0.6 → 10.146.0.5   TCP 66 37518 → 8297 [ACK] Seq=151 Ack=40969 Win=40576 Len=0 TSval=2105319100 TSecr=3286689472
   22 0.554594287   10.146.0.6 → 10.146.0.5   TCP 66 37518 → 8297 [ACK] Seq=151 Ack=45193 Win=36480 Len=0 TSval=2105319100 TSecr=3286689472
   23 0.554622793   10.146.0.5 → 10.146.0.6   TCP 21186 8297 → 37518 [PSH, ACK] Seq=46601 Ack=151 Win=64640 Len=21120 TSval=3286689472 TSecr=2105319100 [TCP segment of a reassembled PDU]
   24 0.554633533   10.146.0.5 → 10.146.0.6   TCP 1474 8297 → 37518 [ACK] Seq=67721 Ack=151 Win=64640 Len=1408 TSval=3286689472 TSecr=2105319100 [TCP segment of a reassembled PDU]
   25 0.554638452   10.146.0.5 → 10.146.0.6   TCP 5698 8297 → 37518 [PSH, ACK] Seq=69129 Ack=151 Win=64640 Len=5632 TSval=3286689472 TSecr=2105319100 [TCP segment of a reassembled PDU]
   26 0.554646269   10.146.0.5 → 10.146.0.6   TCP 2882 8297 → 37518 [PSH, ACK] Seq=74761 Ack=151 Win=64640 Len=2816 TSval=3286689472 TSecr=2105319100 [TCP segment of a reassembled PDU]
   27 0.554649300   10.146.0.5 → 10.146.0.6   TCP 4034 8297 → 37518 [PSH, ACK] Seq=77577 Ack=151 Win=64640 Len=3968 TSval=3286689472 TSecr=2105319100 [TCP segment of a reassembled PDU]
   28 0.554847428   10.146.0.6 → 10.146.0.5   TCP 66 37518 → 8297 [ACK] Seq=151 Ack=67721 Win=52864 Len=0 TSval=2105319100 TSecr=3286689472
   29 0.554854105   10.146.0.5 → 10.146.0.6   TCP 32450 8297 → 37518 [PSH, ACK] Seq=81545 Ack=151 Win=64640 Len=32384 TSval=3286689472 TSecr=2105319100 [TCP segment of a reassembled PDU]
   30 0.554865972   10.146.0.6 → 10.146.0.5   TCP 66 37518 → 8297 [ACK] Seq=151 Ack=74761 Win=48768 Len=0 TSval=2105319100 TSecr=3286689472
   31 0.554866025   10.146.0.6 → 10.146.0.5   TCP 66 37518 → 8297 [ACK] Seq=151 Ack=77577 Win=46720 Len=0 TSval=2105319100 TSecr=3286689472
   32 0.554866074   10.146.0.6 → 10.146.0.5   TCP 66 37518 → 8297 [ACK] Seq=151 Ack=81545 Win=44672 Len=0 TSval=2105319100 TSecr=3286689472
   33 0.554872211   10.146.0.5 → 10.146.0.6   TCP 12354 [TCP Window Full] 8297 → 37518 [PSH, ACK] Seq=113929 Ack=151 Win=64640 Len=12288 TSval=3286689472 TSecr=2105319100 [TCP segment of a reassembled PDU]
   34 0.555064483   10.146.0.6 → 10.146.0.5   TCP 66 37518 → 8297 [ACK] Seq=151 Ack=126217 Win=64128 Len=0 TSval=2105319100 TSecr=3286689472
   35 0.555072286   10.146.0.5 → 10.146.0.6   HTTP 17491 HTTP/1.1 200 OK  (text/plain)
   36 0.555182453   10.146.0.6 → 10.146.0.5   TCP 66 37518 → 8297 [ACK] Seq=151 Ack=143642 Win=99072 Len=0 TSval=2105319100 TSecr=3286689472
   37 0.559839594   10.146.0.6 → 10.146.0.5   TCP 66 37518 → 8297 [FIN, ACK] Seq=151 Ack=143642 Win=99072 Len=0 TSval=2105319105 TSecr=3286689472
   38 0.559995129   10.146.0.5 → 10.146.0.6   TCP 66 8297 → 37518 [FIN, ACK] Seq=143642 Ack=152 Win=64640 Len=0 TSval=3286689477 TSecr=2105319105
   39 0.560086130   10.146.0.6 → 10.146.0.5   TCP 66 37518 → 8297 [ACK] Seq=152 Ack=143643 Win=99072 Len=0 TSval=2105319105 TSecr=3286689477

So, looking at https://pkg.go.dev/net/http#Client (I'm not a regular golang programmer) the 'timeout' option is not a connect timeout option but a limit that will even kill in-progress body data transfer. I'd say 3s for that is a good limit.

It's not good for a collector to tolerate a dead link for anything more than whatever network RTT is, so it's still good to hang up quickly if the TCP transport has no response. The TLSHandshakeTimeout option of the http Transport class (https://pkg.go.dev/net/http#Transport) seems to be the right tuning knob for that. The ResponseHeaderTimeout, ExpectContinueTimeout options of http.Transport look helpful too. Well, I can't promise any of them are exact correspondents to C and C++ libs I've used before and am thinking of. And I know that if you expose any more than 2 of these in the user config file user error becomes the bigger issue :/

akira-kurogane commented 2 years ago

FYI regarding the particular slow prometheus endpoint on the other side. https://github.com/prometheus/jmx_exporter/issues/718

lmonahanwow commented 2 years ago

I've pushed a change to allow configurable timeout in millseconds. Currently untested, which I'll get to over the weekend. If you wish to test it first, add httpTimeout = <milliseconds> to your input configuration.

akira-kurogane commented 2 years ago

Hi Luke

I'm afraid the config setting in git commit 3e7630d12 didn't work, the TCP conversation was terminated at ~560ms again.

I used the inputs.conf below. I also tried with all-lowercase "httptimeout" in case it was a config parser library behaviour to use that instead of camel-case, but no luck there either.

[prometheus://kfk-akira-1]
URI = http://kfk-akira-1.asia-northeast1-b.c.test-stretched-cluster.internal:8297/metrics
index = prometheus
sourcetype = prometheus:metric
interval = 10
disabled = 0
httpTimeout = 990

I checked the splunkd.log output but there's nothing printed out that indicates anything new.

cdecc commented 2 years ago

Hi, I'm facing the same problem: 08-12-2022 10:59:43.843 +0200 ERROR ExecProcessor [66377 ExecProcessorSchedulerThread] - message from "/opt/splunk/etc/apps/modinput_prometheus/linux_x86_64/bin/prometheus" 2022/08/12 08:59:43 Get http://elabc.tst:7070/federate?match%5B%5D=%7B__name__%3D~%22..%2A%22%7D: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Is there already a solution? I tried the httpTimeout = 990 but this didn't help.

Rgds, Chris

akira-kurogane commented 2 years ago

I used 990ms because that was comfortably above the 500 ~ 550ms latency (of reading mbean values via JMX?) that my Kafka JVM had. If you have more properties that you're accessing or you have lower real hardware resources for the JVM it would take even longer.

cdecc commented 2 years ago

I've put it on 10000ms (10 sec) but still get the timeout. While when I execute the same GET via curl from my Heavy forwarder, I see the data rolling over my screen immediately.

akira-kurogane commented 2 years ago

While when I execute the same GET via curl from my Heavy forwarder, I see the data rolling over my screen immediately.

If by immediately you mean definitely less than half a second that's a different symptom to my case. In my case it was > 500ms before the first TCP response, which had the HTTP response headers. (Hence the "Client.Timeout exceeded while awaiting headers" error being reported by this golang exe run as a subprocess by Splunk.)

When I used curl it felt faster, but tcp packet capture showed it was the same timing. curl succeeded because it doesn't have the same timeout, or at least not set so tightly.

cdecc commented 2 years ago

I just found out that this app when downloaded from Splunkbase doesn't support the httpTimeout parameter.

Now I was trying to download the prometheus binary here, but it seems this is uncompiled. Is a compiled version available for download?

lmonahanwow commented 2 years ago

Sorry, I've been unwell and haven't had time to confirm this parameter is working properly or to package it up on Splunkbase.

I will attempt to reach this over the next week or two.

cdecc commented 2 years ago

My experience: When I collect only 1 metric, it works, when I collect multiple (with a wildcard that matches, if fails with the Client.Timeout message. example: match = {__name="ig_route_requestactive"} => works perfect match = {\_name=\~"ig_route_requestactive.*"} => works perfect match = {\_name=\~"ig_routerequest.*"} => fails match = {__name=~"..*"} => fails

While all these queries work in the prometheus interface