Closed shengbinxu closed 11 months ago
Thanks for your issue. We will try to figure out what happens here...
Hi, i did some test about it. Here's my guess about that happens.
EaseProbe_client_kafka_total
is a keep increase number, every time easeprobe do a probe, it increase by one. So for you images.
The fist image about EaseProbe_client_kafka_total{status!="up"}
means your kafka failed at 11:05, since the number not increase anymore, it means it only fails once. EaseProbe_client_kafka_total
is keep increaing, prometheus only get EaseProbe_client_kafka_total{status!="up"}
to 3. So the line is straight.
EaseProbe_client_kafka_total{status="up"}
line is keep increasing (except 11:05), means kafka is alive all the time except when its donw at about 11:05.
For example, in your case, it more like this:
EaseProbe_client_kafka_total=1 status=up
EaseProbe_client_kafka_total=2 status=down
EaseProbe_client_kafka_total=3 status=down
EaseProbe_client_kafka_total=4 status=up
EaseProbe_client_kafka_total=5 status=up
EaseProbe_client_kafka_total=6 status=up
EaseProbe_client_kafka_total=7 status=up
EaseProbe_client_kafka_total=8 status=up
EaseProbe_client_kafka_total=9 status=up
EaseProbe_client_kafka_total=10 status=up
EaseProbe_client_kafka_total=11 status=up
EaseProbe_client_kafka_total=12 status=up
EaseProbe_client_kafka_total=13 status=up
EaseProbe_client_kafka_total=14 status=up
EaseProbe_client_kafka_total=15 status=up
This is what happens here. EaseProbe_client_kafka_total
increase every time do the probe. It status can be up or down based on the result of probe.
here's my naive result, i make the kafka server keep failed.
Thanks for your reply.
From Figure 1, it can be seen that Kafka had already hung before 10:45 and was restored at 11:05. I made this judgment because the value in the up state did not increase. However, in Figure 2, there are no values before 11:05
8314 Tue Nov 28 11:03:04 2023 /usr/local/java/jdk/bin/java -Xmx1100m -Xms1100m -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcur
and use linux ps command, the kafka process was restarted at 11:03, so kafka is restart about 11:05. In my impression, Kafka hung up for about 30 minutes before 11:05
At that time, our Kafka encountered a problem and it was uncertain whether the process was hanging. The problem was that the corresponding Linux server could not be logged in. We logged in to the server through the console in the Alibaba Cloud backend and received a prompt that the server was out of memory. Therefore, we restarted the server and manually started the Kafka process (around 11:05)
Pr #455 may be the root cause of this problem.
After a carefully check of EaseProbe code, I suspect that EaseProbe kafka client timeout not work. This results in Kafka continuously waiting for a response. To verify this, I conducted a test by opening a TCP connection on port 9092, accepting the connection but not sending any data. Consequently, EaseProbe remained in a waiting state, indicating that the timeout was not functioning as expected.
This issue likely arises because the timeout is currently set only for the dialing process and not for the reading process from the connection. I have submitted a pull request to address this issue. However, there could be other factors contributing to this problem, and my analysis represents my best hypothesis.
Thank you for bringing up this issue.
@suchen-sci Thank you so much, it's a big thumbs up!
I conducted a test by opening a TCP connection on port 9092, accepting the connection but not sending any data
How to simulate this process?
package main
import (
"fmt"
"net"
"time"
)
func main() {
l, err := net.Listen("tcp", ":9092")
if err != nil {
fmt.Println(err)
return
}
defer l.Close()
for {
conn, err := l.Accept()
if err != nil {
fmt.Println(err)
} else {
go func(conn net.Conn) {
defer conn.Close()
time.Sleep(300 * time.Minute)
}(conn)
}
}
}
go run server.go
and change easeprobe log level to debug. This naive code just accept the connection and then sleep, while client side of kafka dials success and waits for the response of conn.ReadPartitions()
.
For now, you can see from debug that it timeouts. If you not update the code, you will not see this debug message, because kafka client is still waiting...
DEBU[2023-12-05T16:34:49+08:00] client: {"name":"Kafka Native Client node0","endpoint":"127.0.0.1:9092","time":"2023-12-05T08:34:39.568444Z","timestamp":1701765279568,"rtt":10001067000,"status":"down","prestatus":"down","message":"Error (client/kafka): read tcp 127.0.0.1:53216-\u003e127.0.0.1:9092: i/o timeout","latestdowntime":"2023-12-05T06:58:59.947317Z","recoverytime":100015423000,"stat":{"since":"2023-12-05T06:33:25.616857Z","total":325,"status":{"1":230,"2":95},"uptime":1150000000000,"downtime":475000000000,"StatusHistory":[{"Status":false,"Message":"read tcp 127.0.0.1:53216-\u003e127.0.0.1:9092: i/o timeout"}],"MaxLen":1,"CurrentStatus":false,"StatusCount":1,"alert":{"strategy":"regular","factor":1,"max":1,"notified":55,"failed":55,"next":2,"interval":1}}}
"message":"Error (client/kafka): read tcp 127.0.0.1:53216-\u003e127.0.0.1:9092: i/o timeout" in new code.
I have built the latest code package from the main branch and deployed it online. I look forward to the follow-up results
Environment (please complete the following information):
Describe the bug kafka client probe, the config is:
when node0 down, then it produce prometheus metric:
Then monitor it via PromQL:
Referring to this article at link I believe it's because the label {status=''down"} lacks an initial value.
I'm unsure how to write PromQL to monitor this metric.
Additionally, the value of total{status='up'} keeps increasing continuously (similar to a counter), while the value of total{status='down'} remains constant (the metric remained at 3 for approximately 15 minutes when the Kafka service was down).
Assuming the probe's detection occurs every 1 minute, the value of total{status='down'} should increase by 1 each time it's probed. If that's the case, using the increase() function should make it easier to detect such anomalies."
I've thought of a compromise, which involves monitoring the increment of the value for total{status='up'}. If the increment is 0, it indicates that the service is down. However, I still feel this method might not be very intuitive.
Expected behavior