Closed cg-nz closed 5 years ago
Seems that inside 8 hours it ends up hitting about 2CPU cores constant usage and 6GB in memory then stops responding with;
unable to create native thread: possibly out of memory or process/resource limits reached
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
Container logs
[104.175s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
2019-06-24 04:53:42,463 ERROR pGroup-1-2 o.k.c.ErrorController unable to create native thread: possibly out of memory or process/resource limits reached
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
From what I can see further in the DEBUG/TRACE logs is that it's constantly doing Describes on all the topics offsets, consumergroups over and over until it runs out of threads/ram etc.
Is there a private place you'd like the debug logs put?
Thanks
Is this a constant behavior ? Seems to be really strange and related to https://github.com/tchiotludo/kafkahq/issues/75 I think about reduce the timeout, but it will not fixed the main issue. What is really strange is that don't really control thread, micronaut is doing it, and as I understand, the thread pool is limited !
If you want to send private log, give me a shot here : tchiot.ludo@gmail.com
This is constant, but only happening in one of our clusters, which is a testing one that is bigger than the others. Same amount of topics (140) and Schemas etc, but just more messages. Clusters have the same topology too. All our java microservice applications are still running fine on this cluster.
I'll fire the logs through now, thanks for that! Appreciate your time and effort :)
This is constant behavior in this particular cluster, but it has the same topology and number of topics as the others that are still working fine. The only difference is the number of messages within the cluster (many many more).
The java microservices in this cluster are still working fine with Kafka, just appears to be HQ that's having difficulty.
Actually seems nothing of great confidence in the logs so i'll attach here. Thanks very much indeed. kafkahq.log
I have a quick look, there is a lot strange things in this log & feeling :
You can tell me more about topic in the cluster (number of partition especially) ? Also can you try to isolate only 1 query (and only one) on topic page on the log file ?
Thanks @tchiotludo , appreciate your time.
Partition counts vary, from 1-3 for all topics, except with 2 topics having 12 partitions. All are replica 3.
I've asked others to stop using the UI, did a full restart with trace logging and is attached. All the logs attached are with no one accessing it, not even myself. The Schema registry has several versions per schema registered for some topics.
After 4 minutes of starting, it was using just over 1GB of memory, and 1.5 CPU Cores. Thanks!
Interestingly if I try out the Landoop kafka-topics-ui, this loads it all fine/quickly and keeps working.. but it's really not as lovely and featured as KafkaHQ, nor does it properly deserialise all avro as HQ does.
Could this likely be an issue with several versions of schemas being used for a topic, that causes both applications to have some difficulty?
Thanks!
As I see on last log, avro is not the problem. It's not really easy to understand with a simple log, but seems that KafkaHQ is don't too much query on kafka. I've some kind on internal cache per http request that don't work anymore since I've introduced pagination. This is my first option that I will look at first.
Do you have used KafkaHQ before I add pagination on topic list ? And if yes, do you have the bug before ? This can be a good test to use version 0.7.2, to see if you have the issue
Thanks - actually seems to have been not an issue pre 0.8.0 -- i'll go back to 0.7.2 and see how that goes. Really appreciate your efforts here!
Okay yeah 0.7.2 works, and is actually really snappy and doesn't appear to have the same issue - i'll get back to our users to get them to try it for the next 24 hours and let you know.
Thanks @tchiotludo !
yeah :+1: My optimization is not an optimization so :cry: And the pagination is worst than before !
I'll try to reproduce on my side, but I have a clear view on the reason now ! Thanks for your time on this issue !
@cg-nz Can you try with dev version docker pull tchiotludo/kafkahq:dev
?
I try to make a fix that will avoid duplicate call on kafka api with the same query (consumer group, offset, ...).
Also reduced timeout on api.
Is this work better with your cluster ?
Hey @tchiotludo
Thanks for your time and replying.
Unfortunately the issue returns when using the tchiotludo/kafkahq:dev image. In fact it returns a 504 time out, then 500 and this appears straight away;
"unable to create native thread: possibly out of memory or process/resource limits reached"
2019-07-01 08:56:13,273 ERROR pGroup-1-2 o.k.c.ErrorController unable to create native thread: possibly out of memory or process/resource limits reached
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
at java.base/java.lang.Thread.start0(Native Method)
at java.base/java.lang.Thread.start(Thread.java:803)
Once I revert back to the 0.7.2 and bounce the container, it's straight back to being very snappy, responsive and such.
Thanks v.much.
Regards cg
Do you have limit on containrer ? (cpu / mem / ...) ? If yes, can you share the config please ?
Sorry forgot to mention - there are no imposed limits from OpenShift. We have other workloads running at 2vCPU/10GB ram no problems. It appears KafkaHQ on this occasion only hit 0.3vCPU and 1GB Memory and the error occurred on the dev image.
Right now back on 0.7.2 it's using a little less than that, but is absolutely flying along nicely.
Perhaps a 0.8.0 image but without the pagination, or am I the only person observing this currently?
As it is, 0.7.2 is brilliant and is a huge help. Thank you.
@cg-nz have you tried to remove your zookeeper container and then docker-compose up?
@parisian Cheers, but that doesn't really apply to our setup using Kubernetes/OpenShift AMQ Streams with strimzi operator. Thanks for suggesting though!
Thanks
@cg-nz can you resend me a log please ? Lacking of idea here for now ... :cry:
@cg-nz
Just got an idea digging the web : 2 options :
Seems to say that you've reach a user limit of number of process (thread) on your node. You try to add more ?
Another options (but as I can read the JVM message is misleading and it's not a memory trouble, so must not be the solution) :
can you try to tune the JVM options to see if it works ?
Just add env variable JAVA_OPTS='-Xmx2g -Xms2g'
must be working to raise memory usage
Thanks
I've tried using -Xms512m -Xmx4096m with the same issue. We're not out of threads on the nodes per process as there are dozens of other containers running with more threads than this. Could it be a micronaut imposed limit?
As I know, micronaut don't enforced this, will dig it to be sure.
There is a prometheus endpoint on kafkahq /prometheus
, can you send me the output ?
Especially the process_files_open_files 200.0
& process_files_max_files 1048576.0
The full output will be nice since there is also some information about thread on executor
& jvm_threads_live_threads
for example that can help
Thanks
closing in favor of #137
Hey There
Thanks again for a great app, it really is superb for what you've done so far.
We seem to be starting to get a number of 504 timeouts caused by the below error;
I've tried changing some of the consumer properties for clients-defaults.consumer.properties.session.timeout.ms|heartbeat.interval.ms which showed some benefits.
This is running against docker image confluentinc/cp-schema-registry:5.2.2, which I've just tried updating from 5.0.1 we were using previously. The Kafka Clusters (multiple) are 3 brokers in each, running AMQStreams by RedHat, kafka version 2.0.0.
openjdk:11 for the base image, and adding the other files (.jar, kafkahq script etc) in manually.
We have about a dozen microservices that send/receive messages to/from this kafka from on and externally to OpenShift without timeouts etc.
Upping debugging doesn't seem to show anything really more in the logs, but I'm more than happy to provide info as needed.
I do note that it appears when more than one person is concurrently using it from different PC's that it starts to falter.
Application.yml
Thanks