Open akira-kurogane opened 2 years ago
The obvious problem would be network reachability, however you have been able to rule that out pretty well with curl.
I'm wondering if there's a possibility SELinux or similar is blocking the connection whilst running under Splunk?
Otherwise: I'll re-test with Splunk 9 today as latest testing was Splunk 8.x just in case anything unexpected was introduced there.
Thanks Luke.
I considered the SELinux idea too, but nothing has been configured for that.
Beyond the curl test made at the same time the connection timeout was being reported I can add that the same Splunk process can fetch JMX monitoring data directly from the same Kafka server process using the "Monitoring of Java Virtual Machines with JMX" app (the one with app id SPLUNK4JMX). It was disabled at the time I tested yesterday, but I've confirmed it still works by re-enabling it via the data inputs config screen, i.e. without restart of the Splunk server.
I installed a v8.2.6 Splunk on a fresh server today. I'm afraid it's the same issue. Happy to share logs and try other diagnostic steps if it will help.
I'm out of any simple ideas. I'll see if I can find time to replicate somewhat, or possibly add some further logging in the connector to see if we can figure it out. Can you let me know what your scraping target (exporter) is and what Linux OS you are running Splunk on?
Sorry, I shouldn't have just posted that v8 was the same and just left it there. Something was eliminated and that's a bit of progress, but not much.
But once I was less worried it was a version or security 'feature' of v9 Splunk I thought it might be worthwhile to sniff the traffic just in case a connection was really happening (depsite the "client timeout" message). I did and it's interesting.
Request-response TCP conversation from Splunk with this add-on dumped in ascii:
akira@kfk-akira-1:~$ f=a.modinput_splunk_req.pcap
akira@kfk-akira-1:~$ for stream in `tshark -r "$f" -2 -R "tcp and (http.request or http.response)" -T fields -e tcp.stream | sort -n | uniq`; do
> echo "==========BEGIN REQUEST=========="
> tshark -q -r "$f" -z follow,tcp,ascii,$stream;
> echo "==========END REQUEST=========="
> done;
==========BEGIN REQUEST==========
===================================================================
Follow: tcp,ascii
Filter: tcp.stream eq 0
Node 0: 10.146.0.6:48606
Node 1: 10.146.0.5:8297
156
GET /metrics HTTP/1.1
Host: kfk-akira-1.asia-northeast1-b.c.test-stretched-cluster.internal:8297
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip
164
HTTP/1.1 200 OK
Content-encoding: gzip
Date: Wed, 29 Jun 2022 02:11:14 GMT
Transfer-encoding: chunked
Content-type: text/plain; version=0.0.4; charset=utf-8
===================================================================
==========END REQUEST==========
I.e. The response is HTTP 200 OK, but the content was not returned. Or rather it moves into chunked mode and then never sends any chunks?
For comparison this is the healthy request-response TCP conversation caused by a curl from the same server:
akira@kfk-akira-1:~$ f=b.curl_req.pcap
akira@kfk-akira-1:~$ for stream in `tshark -r "$f" -2 -R "tcp and (http.request or http.response)" -T fields -e tcp.stream | sort -n | uniq`; do echo "==========BEGIN REQUEST=========="; tshark -q -r "$f" -z follow,tcp,ascii,$stream; echo "==========END REQUEST=========="; done;
==========BEGIN REQUEST==========
===================================================================
Follow: tcp,ascii
Filter: tcp.stream eq 0
Node 0: 10.146.0.7:37812
Node 1: 10.146.0.5:8297
139
GET /metrics HTTP/1.1
Host: kfk-akira-1.asia-northeast1-b.c.test-stretched-cluster.internal:8297
User-Agent: curl/7.58.0
Accept: */*
136
HTTP/1.1 200 OK
Date: Wed, 29 Jun 2022 02:12:19 GMT
Content-type: text/plain; version=0.0.4; charset=utf-8
Content-length: 143328
7040
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 12468.06
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.656312303304E9
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 216.0
...
...
...
...
jvm_memory_pool_allocated_bytes
224
_created{pool="Compressed Class Space",} 1.656312304834E9
jvm_memory_pool_allocated_bytes_created{pool="Metaspace",} 1.656312304834E9
jvm_memory_pool_allocated_bytes_created{pool="CodeHeap 'non-nmethods'",} 1.656312304834E9
===================================================================
==========END REQUEST==========
I've adjust the firewall in this test VPC with the test servers. The test server above can be accessed globally for the moment with http://34.85.5.234:8297/metrics.
For clarification the prometheus exporter service is a kafka broker with the JMX-Prometheus exporter javaagent in it.
Here's another comparison observation to add.
We can request the promethues metrics using a normal browser of course. Using Chrome on my laptop to fetch http://34.85.5.234:8297/metrics I get the full result. It sends these HTTP request headers. The response HTTP headers are identical.
GET /metrics HTTP/1.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Encoding: gzip, deflate
Accept-Language: en-AU,en;q=0.9,ja-JP;q=0.8,ja;q=0.7,en-GB;q=0.6,en-US;q=0.5
Cache-Control: max-age=0
Connection: keep-alive
DNT: 1
Host: 34.85.5.234:8297
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
sec-gpc: 1
I've added explicit Accept headers, as that may cause the behaviour of the exporter to change. That's in the master branch now: can you build yourself?
Here's a copy of the executable with this change if not (put in linux_x86_64/bin):
https://drive.google.com/drive/folders/1Q9E9x3L6ReiW0ulutQuCHV8XiFxwsQkK
Hi Luke. That's really nice of you.
I could use the binary just as it was. Unfortunately it didn't lead to the jmx_exporter http service replying in non-chunked way.
akira@kfk-akira-1:~$ f=a.modinput_acct_hdr_fix.pcap
akira@kfk-akira-1:~$ for stream in `tshark -r "$f" -2 -R "tcp and (http.request or http.response)" -T fields -e tcp.stream | sort -n | uniq`; do echo "==========BEGIN REQUEST=========="; tshark -q -r "$f" -z follow,tcp,ascii,$stream; echo "==========END REQUEST=========="; done;
==========BEGIN REQUEST==========
===================================================================
Follow: tcp,ascii
Filter: tcp.stream eq 0
Node 0: 10.146.0.6:37950
Node 1: 10.146.0.5:8297
299
GET /metrics HTTP/1.1
Host: kfk-akira-1.asia-northeast1-b.c.test-stretched-cluster.internal:8297
User-Agent: Go-http-client/1.1
Accept: application/openmetrics-text;version=1.0.0,application/openmetrics-text;version=0.0.1;q=0.75,text/plain;version=0.0.4;q=0.5,*/*;q=0.1
Accept-Encoding: gzip
182
HTTP/1.1 200 OK
Content-encoding: gzip
Date: Wed, 29 Jun 2022 04:31:24 GMT
Transfer-encoding: chunked
Content-type: application/openmetrics-text; version=1.0.0; charset=utf-8
===================================================================
==========END REQUEST==========
Looking at the request headers from the curl case, which got a non-chunked response
Accept: */*
User-Agent: curl/7.58.0
.I need to completely restudy what I once knew about the chunked transfer protocol ...
Or maybe it's just the Connection: keep-alive
needed, if accepting the chunked way.
This is what the chrome web browser does:
kira@kfk-akira-1:~$ f=c.chrome_page_read.pcap
akira@kfk-akira-1:~$ for stream in `tshark -r "$f" -2 -R "tcp and (http.request or http.response)" -T fields -e tcp.stream | sort -n | uniq`; do echo "==========BEGIN REQUEST=========="; tshark -q -r "$f" -z follow,tcp,ascii,$stream; echo "==========END REQUEST=========="; done;
==========BEGIN REQUEST==========
===================================================================
Follow: tcp,ascii
Filter: tcp.stream eq 0
Node 0: 157.107.9.43:35928
Node 1: 10.146.0.5:8297
519
GET /metrics HTTP/1.1
Host: 34.85.5.234:8297
Connection: keep-alive
Cache-Control: max-age=0
DNT: 1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Encoding: gzip, deflate
Accept-Language: en-AU,en;q=0.9,ja-JP;q=0.8,ja;q=0.7,en-GB;q=0.6,en-US;q=0.5
sec-gpc: 1
164
HTTP/1.1 200 OK
Content-encoding: gzip
Date: Wed, 29 Jun 2022 05:02:22 GMT
Transfer-encoding: chunked
Content-type: text/plain; version=0.0.4; charset=utf-8
2804
1000
...........][s...~..@y..n.....V.A..D.u ...
<GZIPPED BINARY DATA, PRESUMABLY>
<ITS NORMAL PROMETHEUS RESPONSE WHEN UNZIPPED>
....
...
...
Keep-Alive should already be completely active.
I've tried a new change: explicitly set the Transfer-Encoding header on the request. You can get that version from the same GDrive link.
Thanks. I didn't see the change in effect unfortunately.
akira@kfk-akira-1:~$ f=b.second_fix_test.pcap
akira@kfk-akira-1:~$ for stream in `tshark -r "$f" -2 -R "tcp and (http.request or http.response)" -T fields -e tcp.stream | sort -n | uniq`; do echo "==========BEGIN REQUEST=========="; tshark -q -r "$f" -z follow,tcp,ascii,$stream; echo "==========END REQUEST=========="; done;
==========BEGIN REQUEST==========
===================================================================
Follow: tcp,ascii
Filter: tcp.stream eq 0
Node 0: 10.146.0.6:45338
Node 1: 10.146.0.5:8297
299
GET /metrics HTTP/1.1
Host: kfk-akira-1.asia-northeast1-b.c.test-stretched-cluster.internal:8297
User-Agent: Go-http-client/1.1
Accept: application/openmetrics-text;version=1.0.0,application/openmetrics-text;version=0.0.1;q=0.75,text/plain;version=0.0.4;q=0.5,*/*;q=0.1
Accept-Encoding: gzip
182
HTTP/1.1 200 OK
Content-encoding: gzip
Date: Wed, 29 Jun 2022 05:30:01 GMT
Transfer-encoding: chunked
Content-type: application/openmetrics-text; version=1.0.0; charset=utf-8
===================================================================
==========END REQUEST==========
I double-checked I wasn't using the older binary. I'm not - currently using the one with md5sum
7b8d3fdaf14b06c21c51720187130b52.
Keep-Alive should already be completely active.
As in TCP keep-alive? I didn't track or examine that yet t.b.h.
I was only noticing the Connection: keep-alive
HTTP header in the Chrome web browser request. But this is just me noticing differences in text in front of me. When I look at docs about chunked transfer encoding it doesn't specify that this HTTP 1.x header is required.
I tried adding the "Connect: keep-alive" header, purely as hack. It didn't change anything.
For another comparison the this program works fine with the well-known node_exporter http service - capture shown below. (This capture has the "Connect: keep-alive" header, but it's successful without it too.)
GET /metrics HTTP/1.1
Host: kfk-akira-1.asia-northeast1-b.c.hsbc-stretched-cluster.internal:9100
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip
Connect: keep-alive
4096
HTTP/1.1 200 OK
Content-Encoding: gzip
Content-Type: text/plain; version=0.0.4; charset=utf-8
Date: Wed, 29 Jun 2022 12:10:51 GMT
Transfer-Encoding: chunked
800
"YU.[.Te..W....5O.:.r..F].... <THE REST OF GZIP'ED BODY>
The chunk size (800 bytes in this case) and the body is simply sent, where it hangs or exits (not sure which) when its jmx_exporter
So there's something about the interplay between the http server of jmx_exporter and this program. In a matrix of two servers (node_exporter (golang) and jmx_exporter (Java) and three clients (ordinary web browser, curl, and this program) this is the only failing one. I'm out of ideas for today, need a walk or something to make some more.
Ahah, looks like it timeout. The response was being delivered at ~550ms exceeding the currently hard-coded limit
prometheus/prometheus.go:
client := &http.Client{Transport: tr, Timeout: time.Duration(500000 * time.Microsecond)}
After tweaking it I got the full response, and I am filling my Splunk index with the metrics already.
I'll share a wrap-up later, in daylight hours.
I'm surprised at the time it takes for that exporter to respond -- thought 500ms would be fine for any. Best option will be to make it configurable, so I'll setup that when I can.
An alternative I'd like to suggest is keep 500ms (or less) as the initial TCP timeout, set something higher say 3s as the max response receiving window.
The default output of the tshark CLI shows timing, which is handy. Below is a typical response seen with the v1.0.0 modinput_prometheus exe, and all the header modificiations we tried yesterday. 10.146.0.6 is the splunk server running with modinput_prometheus, 10.146.0.5 is the kafka server with jmx_exporter javaagent daemon thread running in it and serving on port 8297.
akira@kfk-akira-1:~$ tshark -r a.monahan.pcap
1 0.000000000 10.146.0.6 → 10.146.0.5 TCP 74 48606 → 8297 [SYN] Seq=0 Win=65320 Len=0 MSS=1420 SACK_PERM=1 TSval=2054752694 TSecr=0 WS=128
2 0.000033258 10.146.0.5 → 10.146.0.6 TCP 74 8297 → 48606 [SYN, ACK] Seq=0 Ack=1 Win=64768 Len=0 MSS=1420 SACK_PERM=1 TSval=3236123067 TSecr=2054752694 WS=128
3 0.000593839 10.146.0.6 → 10.146.0.5 TCP 66 48606 → 8297 [ACK] Seq=1 Ack=1 Win=65408 Len=0 TSval=2054752696 TSecr=3236123067
4 0.000860427 10.146.0.6 → 10.146.0.5 HTTP 222 GET /metrics HTTP/1.1
5 0.000874242 10.146.0.5 → 10.146.0.6 TCP 66 8297 → 48606 [ACK] Seq=1 Ack=157 Win=64640 Len=0 TSval=3236123068 TSecr=2054752696
6 0.494948139 10.146.0.6 → 10.146.0.5 TCP 66 48606 → 8297 [FIN, ACK] Seq=157 Ack=1 Win=65408 Len=0 TSval=2054753190 TSecr=3236123068
7 0.535274953 10.146.0.5 → 10.146.0.6 TCP 66 8297 → 48606 [ACK] Seq=1 Ack=158 Win=64640 Len=0 TSval=3236123603 TSecr=2054753190
8 0.569527987 10.146.0.5 → 10.146.0.6 TCP 230 HTTP/1.1 200 OK [TCP segment of a reassembled PDU]
9 0.569652653 10.146.0.6 → 10.146.0.5 TCP 54 48606 → 8297 [RST] Seq=158 Win=0 Len=0
Note that TCP back and forth is immediate. TCP handshake(?) stuff is acked and the HTTP GET packet is sent in only 0.860 ms. The HTTP response body is sent back at ~560 ms. The OS receives the first TCP window which had the headers, which we saw, but the process (modinput_prometheus' prometheus exe) had already shut the door apparently.
Just for interest's sake the same when I expanded the timeout was this:
akira@kfk-akira-1:~$ tshark -r e.1500ms_timeout.pcap
1 0.000000000 10.146.0.6 → 10.146.0.5 TCP 74 37518 → 8297 [SYN] Seq=0 Win=65320 Len=0 MSS=1420 SACK_PERM=1 TSval=2105318544 TSecr=0 WS=128
2 0.000028538 10.146.0.5 → 10.146.0.6 TCP 74 8297 → 37518 [SYN, ACK] Seq=0 Ack=1 Win=64768 Len=0 MSS=1420 SACK_PERM=1 TSval=3286688917 TSecr=2105318544 WS=128
3 0.000793135 10.146.0.6 → 10.146.0.5 TCP 66 37518 → 8297 [ACK] Seq=1 Ack=1 Win=65408 Len=0 TSval=2105318546 TSecr=3286688917
4 0.001064929 10.146.0.6 → 10.146.0.5 HTTP 216 GET /metrics HTTP/1.1
5 0.001073132 10.146.0.5 → 10.146.0.6 TCP 66 8297 → 37518 [ACK] Seq=1 Ack=151 Win=64640 Len=0 TSval=3286688918 TSecr=2105318546
6 0.554189872 10.146.0.5 → 10.146.0.6 TCP 202 HTTP/1.1 200 OK [TCP segment of a reassembled PDU]
7 0.554339979 10.146.0.6 → 10.146.0.5 TCP 66 37518 → 8297 [ACK] Seq=151 Ack=137 Win=65280 Len=0 TSval=2105319099 TSecr=3286689472
8 0.554357411 10.146.0.5 → 10.146.0.6 TCP 7106 8297 → 37518 [PSH, ACK] Seq=137 Ack=151 Win=64640 Len=7040 TSval=3286689472 TSecr=2105318546 [TCP segment of a reassembled PDU]
9 0.554368649 10.146.0.5 → 10.146.0.6 TCP 5698 8297 → 37518 [PSH, ACK] Seq=7177 Ack=151 Win=64640 Len=5632 TSval=3286689472 TSecr=2105318546 [TCP segment of a reassembled PDU]
10 0.554394504 10.146.0.5 → 10.146.0.6 TCP 2882 8297 → 37518 [PSH, ACK] Seq=12809 Ack=151 Win=64640 Len=2816 TSval=3286689472 TSecr=2105319099 [TCP segment of a reassembled PDU]
11 0.554445624 10.146.0.6 → 10.146.0.5 TCP 66 37518 → 8297 [ACK] Seq=151 Ack=7177 Win=61056 Len=0 TSval=2105319100 TSecr=3286689472
12 0.554445678 10.146.0.6 → 10.146.0.5 TCP 66 37518 → 8297 [ACK] Seq=151 Ack=12809 Win=56960 Len=0 TSval=2105319100 TSecr=3286689472
13 0.554455747 10.146.0.5 → 10.146.0.6 TCP 11330 8297 → 37518 [PSH, ACK] Seq=15625 Ack=151 Win=64640 Len=11264 TSval=3286689472 TSecr=2105319100 [TCP segment of a reassembled PDU]
14 0.554461476 10.146.0.5 → 10.146.0.6 TCP 2882 8297 → 37518 [PSH, ACK] Seq=26889 Ack=151 Win=64640 Len=2816 TSval=3286689472 TSecr=2105319100 [TCP segment of a reassembled PDU]
15 0.554466410 10.146.0.5 → 10.146.0.6 TCP 11330 8297 → 37518 [PSH, ACK] Seq=29705 Ack=151 Win=64640 Len=11264 TSval=3286689472 TSecr=2105319100 [TCP segment of a reassembled PDU]
16 0.554472932 10.146.0.6 → 10.146.0.5 TCP 66 37518 → 8297 [ACK] Seq=151 Ack=15625 Win=54912 Len=0 TSval=2105319100 TSecr=3286689472
17 0.554475502 10.146.0.5 → 10.146.0.6 TCP 4290 8297 → 37518 [PSH, ACK] Seq=40969 Ack=151 Win=64640 Len=4224 TSval=3286689472 TSecr=2105319100 [TCP segment of a reassembled PDU]
18 0.554478319 10.146.0.5 → 10.146.0.6 TCP 1474 8297 → 37518 [ACK] Seq=45193 Ack=151 Win=64640 Len=1408 TSval=3286689472 TSecr=2105319100 [TCP segment of a reassembled PDU]
19 0.554594156 10.146.0.6 → 10.146.0.5 TCP 66 37518 → 8297 [ACK] Seq=151 Ack=26889 Win=48768 Len=0 TSval=2105319100 TSecr=3286689472
20 0.554594208 10.146.0.6 → 10.146.0.5 TCP 66 37518 → 8297 [ACK] Seq=151 Ack=29705 Win=46720 Len=0 TSval=2105319100 TSecr=3286689472
21 0.554594248 10.146.0.6 → 10.146.0.5 TCP 66 37518 → 8297 [ACK] Seq=151 Ack=40969 Win=40576 Len=0 TSval=2105319100 TSecr=3286689472
22 0.554594287 10.146.0.6 → 10.146.0.5 TCP 66 37518 → 8297 [ACK] Seq=151 Ack=45193 Win=36480 Len=0 TSval=2105319100 TSecr=3286689472
23 0.554622793 10.146.0.5 → 10.146.0.6 TCP 21186 8297 → 37518 [PSH, ACK] Seq=46601 Ack=151 Win=64640 Len=21120 TSval=3286689472 TSecr=2105319100 [TCP segment of a reassembled PDU]
24 0.554633533 10.146.0.5 → 10.146.0.6 TCP 1474 8297 → 37518 [ACK] Seq=67721 Ack=151 Win=64640 Len=1408 TSval=3286689472 TSecr=2105319100 [TCP segment of a reassembled PDU]
25 0.554638452 10.146.0.5 → 10.146.0.6 TCP 5698 8297 → 37518 [PSH, ACK] Seq=69129 Ack=151 Win=64640 Len=5632 TSval=3286689472 TSecr=2105319100 [TCP segment of a reassembled PDU]
26 0.554646269 10.146.0.5 → 10.146.0.6 TCP 2882 8297 → 37518 [PSH, ACK] Seq=74761 Ack=151 Win=64640 Len=2816 TSval=3286689472 TSecr=2105319100 [TCP segment of a reassembled PDU]
27 0.554649300 10.146.0.5 → 10.146.0.6 TCP 4034 8297 → 37518 [PSH, ACK] Seq=77577 Ack=151 Win=64640 Len=3968 TSval=3286689472 TSecr=2105319100 [TCP segment of a reassembled PDU]
28 0.554847428 10.146.0.6 → 10.146.0.5 TCP 66 37518 → 8297 [ACK] Seq=151 Ack=67721 Win=52864 Len=0 TSval=2105319100 TSecr=3286689472
29 0.554854105 10.146.0.5 → 10.146.0.6 TCP 32450 8297 → 37518 [PSH, ACK] Seq=81545 Ack=151 Win=64640 Len=32384 TSval=3286689472 TSecr=2105319100 [TCP segment of a reassembled PDU]
30 0.554865972 10.146.0.6 → 10.146.0.5 TCP 66 37518 → 8297 [ACK] Seq=151 Ack=74761 Win=48768 Len=0 TSval=2105319100 TSecr=3286689472
31 0.554866025 10.146.0.6 → 10.146.0.5 TCP 66 37518 → 8297 [ACK] Seq=151 Ack=77577 Win=46720 Len=0 TSval=2105319100 TSecr=3286689472
32 0.554866074 10.146.0.6 → 10.146.0.5 TCP 66 37518 → 8297 [ACK] Seq=151 Ack=81545 Win=44672 Len=0 TSval=2105319100 TSecr=3286689472
33 0.554872211 10.146.0.5 → 10.146.0.6 TCP 12354 [TCP Window Full] 8297 → 37518 [PSH, ACK] Seq=113929 Ack=151 Win=64640 Len=12288 TSval=3286689472 TSecr=2105319100 [TCP segment of a reassembled PDU]
34 0.555064483 10.146.0.6 → 10.146.0.5 TCP 66 37518 → 8297 [ACK] Seq=151 Ack=126217 Win=64128 Len=0 TSval=2105319100 TSecr=3286689472
35 0.555072286 10.146.0.5 → 10.146.0.6 HTTP 17491 HTTP/1.1 200 OK (text/plain)
36 0.555182453 10.146.0.6 → 10.146.0.5 TCP 66 37518 → 8297 [ACK] Seq=151 Ack=143642 Win=99072 Len=0 TSval=2105319100 TSecr=3286689472
37 0.559839594 10.146.0.6 → 10.146.0.5 TCP 66 37518 → 8297 [FIN, ACK] Seq=151 Ack=143642 Win=99072 Len=0 TSval=2105319105 TSecr=3286689472
38 0.559995129 10.146.0.5 → 10.146.0.6 TCP 66 8297 → 37518 [FIN, ACK] Seq=143642 Ack=152 Win=64640 Len=0 TSval=3286689477 TSecr=2105319105
39 0.560086130 10.146.0.6 → 10.146.0.5 TCP 66 37518 → 8297 [ACK] Seq=152 Ack=143643 Win=99072 Len=0 TSval=2105319105 TSecr=3286689477
So, looking at https://pkg.go.dev/net/http#Client (I'm not a regular golang programmer) the 'timeout' option is not a connect timeout option but a limit that will even kill in-progress body data transfer. I'd say 3s for that is a good limit.
It's not good for a collector to tolerate a dead link for anything more than whatever network RTT is, so it's still good to hang up quickly if the TCP transport has no response. The TLSHandshakeTimeout option of the http Transport class (https://pkg.go.dev/net/http#Transport) seems to be the right tuning knob for that. The ResponseHeaderTimeout, ExpectContinueTimeout options of http.Transport look helpful too. Well, I can't promise any of them are exact correspondents to C and C++ libs I've used before and am thinking of. And I know that if you expose any more than 2 of these in the user config file user error becomes the bigger issue :/
FYI regarding the particular slow prometheus endpoint on the other side. https://github.com/prometheus/jmx_exporter/issues/718
I've pushed a change to allow configurable timeout in millseconds. Currently untested, which I'll get to over the weekend. If you wish to test it first, add httpTimeout = <milliseconds>
to your input configuration.
Hi Luke
I'm afraid the config setting in git commit 3e7630d12 didn't work, the TCP conversation was terminated at ~560ms again.
I used the inputs.conf below. I also tried with all-lowercase "httptimeout" in case it was a config parser library behaviour to use that instead of camel-case, but no luck there either.
[prometheus://kfk-akira-1]
URI = http://kfk-akira-1.asia-northeast1-b.c.test-stretched-cluster.internal:8297/metrics
index = prometheus
sourcetype = prometheus:metric
interval = 10
disabled = 0
httpTimeout = 990
I checked the splunkd.log output but there's nothing printed out that indicates anything new.
Hi, I'm facing the same problem: 08-12-2022 10:59:43.843 +0200 ERROR ExecProcessor [66377 ExecProcessorSchedulerThread] - message from "/opt/splunk/etc/apps/modinput_prometheus/linux_x86_64/bin/prometheus" 2022/08/12 08:59:43 Get http://elabc.tst:7070/federate?match%5B%5D=%7B__name__%3D~%22..%2A%22%7D: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Is there already a solution? I tried the httpTimeout = 990 but this didn't help.
Rgds, Chris
I used 990ms because that was comfortably above the 500 ~ 550ms latency (of reading mbean values via JMX?) that my Kafka JVM had. If you have more properties that you're accessing or you have lower real hardware resources for the JVM it would take even longer.
I've put it on 10000ms (10 sec) but still get the timeout. While when I execute the same GET via curl from my Heavy forwarder, I see the data rolling over my screen immediately.
While when I execute the same GET via curl from my Heavy forwarder, I see the data rolling over my screen immediately.
If by immediately you mean definitely less than half a second that's a different symptom to my case. In my case it was > 500ms before the first TCP response, which had the HTTP response headers. (Hence the "Client.Timeout exceeded while awaiting headers" error being reported by this golang exe run as a subprocess by Splunk.)
When I used curl it felt faster, but tcp packet capture showed it was the same timing. curl succeeded because it doesn't have the same timeout, or at least not set so tightly.
I just found out that this app when downloaded from Splunkbase doesn't support the httpTimeout parameter.
Now I was trying to download the prometheus binary here, but it seems this is uncompiled. Is a compiled version available for download?
Sorry, I've been unwell and haven't had time to confirm this parameter is working properly or to package it up on Splunkbase.
I will attempt to reach this over the next week or two.
My experience: When I collect only 1 metric, it works, when I collect multiple (with a wildcard that matches, if fails with the Client.Timeout message. example: match = {__name="ig_route_requestactive"} => works perfect match = {\_name=\~"ig_route_requestactive.*"} => works perfect match = {\_name=\~"ig_routerequest.*"} => fails match = {__name=~"..*"} => fails
While all these queries work in the prometheus interface
Greetings.I installed v1.0.0 in Splunk Enterprise v9.0.0 today, but I'm afraid I can't get past this issue.
My config in/etc/apps/modinput_prometheus/local/inputs.conf is
The index "prometheus" has already been created, as "metrics" type.
Unfortunately nothing is being collected. Instead the following message appears in/var/log/splunkd.log, once per interval period.
If I use curl to fetch
http://hostabc.my.domain:8297/metrics
from the shell on the same server there is no problem though.Would anyone have any clues?