pinterest / DoctorK

DoctorK is a service for Kafka cluster auto healing and workload balancing
Apache License 2.0
633 stars 92 forks source link

DrKafkaStart failed, NullPointerException getProcessingStartOffsets #87

Open yhzhao opened 5 years ago

yhzhao commented 5 years ago

java -server -cp lib/:doctorkafka-0.2.4.3.jar com.pinterest.doctorkafka.DoctorKafkaMain server dropwizard.yml ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html for instructions on how to configure Log4j 2 INFO [2019-01-17 21:15:44,236] io.dropwizard.server.DefaultServerFactory: Registering jersey handler with root path prefix: / INFO [2019-01-17 21:15:44,239] io.dropwizard.server.DefaultServerFactory: Registering admin handler with root path prefix: / INFO [2019-01-17 21:15:44,239] io.dropwizard.assets.AssetsBundle: Registering AssetBundle with name: assets for path / INFO [2019-01-17 21:15:44,416] com.pinterest.doctorkafka.util.ZookeeperClient: Initialize curator with zkurl:df_zookeeper_dfops_743_1_development_datafabric_kafka_aws_us_ea.aws.athenahealth.com:2181/ INFO [2019-01-17 21:15:44,487] org.apache.curator.utils.Compatibility: Running in ZooKeeper 3.4.x compatibility mode INFO [2019-01-17 21:15:44,513] org.apache.curator.framework.imps.CuratorFrameworkImpl: Starting INFO [2019-01-17 21:15:44,524] org.apache.zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT INFO [2019-01-17 21:15:44,524] org.apache.zookeeper.ZooKeeper: Client environment:host.name=ip-10-129-102-236.us-east-1.aws.athenahealth.com INFO [2019-01-17 21:15:44,524] org.apache.zookeeper.ZooKeeper: Client environment:java.version=1.8.0_191 INFO [2019-01-17 21:15:44,524] org.apache.zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation INFO [2019-01-17 21:15:44,524] org.apache.zookeeper.ZooKeeper: Client environment:java.home=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.191.b12-1.el7_6.x86_64/jre INFO [2019-01-17 21:15:44,524] org.apache.zookeeper.ZooKeeper: Client environment:java.class.path=lib/kafkastats-0.2.4.3.jar:lib/commons-cli-1.3.1.jar:lib/avro-1.8.2.jar:lib/jackson-core-asl-1.9.13.jar:lib/jackson-mapper-asl-1.9.13.jar:lib/paranamer-2.7.jar:lib/snappy-java-1.1.1.3.jar:lib/commons-compress-1.8.1.jar:lib/xz-1.5.jar:lib/slf4j-api-1.7.25.jar:lib/kafka_2.12-1.1.0.jar:lib/kafka-clients-1.1.0.jar:lib/lz4-java-1.4.jar:lib/jackson-databind-2.9.8.jar:lib/jackson-annotations-2.9.8.jar:lib/jackson-core-2.9.8.jar:lib/jopt-simple-5.0.4.jar:lib/metrics-core-2.2.0.jar:lib/scala-library-2.12.7.jar:lib/scala-reflect-2.12.4.jar:lib/scala-logging_2.12-3.7.2.jar:lib/zkclient-0.10.jar:lib/zookeeper-3.4.10.jar:lib/log4j-1.2.16.jar:lib/jline-0.9.94.jar:lib/junit-3.8.1.jar:lib/netty-3.10.5.Final.jar:lib/log4j-core-2.11.1.jar:lib/log4j-api-2.11.1.jar:lib/ostrich_2.12-9.27.0.jar:lib/util-core_2.12-6.43.0.jar:lib/util-function_2.12-6.43.0.jar:lib/scala-parser-combinators_2.12-1.0.4.jar:lib/util-eval_2.12-6.43.0.jar:lib/scala-compiler-2.12.1.jar:lib/scala-xml_2.12-1.0.6.jar:lib/util-logging_2.12-6.43.0.jar:lib/util-app_2.12-6.43.0.jar:lib/util-registry_2.12-6.43.0.jar:lib/util-stats_2.12-6.43.0.jar:lib/util-lint_2.12-6.43.0.jar:lib/caffeine-2.3.4.jar:lib/jsr305-1.3.9.jar:lib/util-jvm_2.12-6.43.0.jar:lib/jackson-module-scala_2.12-2.9.7.jar:lib/jackson-module-paranamer-2.9.7.jar:lib/guava-23.0.jar:lib/error_prone_annotations-2.0.18.jar:lib/j2objc-annotations-1.1.jar:lib/animal-sniffer-annotations-1.14.jar:lib/commons-lang3-3.6.jar:lib/commons-configuration2-2.4.jar:lib/commons-text-1.6.jar:lib/commons-logging-1.2.jar:lib/commons-math3-3.6.1.jar:lib/commons-beanutils-1.9.3.jar:lib/commons-collections-3.2.2.jar:lib/commons-validator-1.6.jar:lib/commons-digester-1.8.1.jar:lib/curator-framework-4.0.1.jar:lib/curator-client-4.0.1.jar:lib/metrics-core-3.2.3.jar:lib/gson-2.8.2.jar:lib/javax.annotation-api-1.3.2.jar:lib/jaxws-api-2.3.1.jar:lib/jaxb-api-2.3.1.jar:lib/javax.activation-api-1.2.0.jar:lib/javax.xml.soap-api-1.4.0.jar:lib/dropwizard-assets-1.3.8.jar:lib/dropwizard-core-1.3.8.jar:lib/dropwizard-util-1.3.8.jar:lib/joda-time-2.9.9.jar:lib/dropwizard-jackson-1.3.8.jar:lib/jackson-datatype-guava-2.9.6.jar:lib/jackson-datatype-jsr310-2.9.6.jar:lib/jackson-datatype-jdk8-2.9.6.jar:lib/jackson-module-parameter-names-2.9.6.jar:lib/jackson-module-afterburner-2.9.6.jar:lib/jackson-datatype-joda-2.9.6.jar:lib/dropwizard-validation-1.3.8.jar:lib/hibernate-validator-5.4.2.Final.jar:lib/validation-api-1.1.0.Final.jar:lib/jboss-logging-3.3.0.Final.jar:lib/classmate-1.3.1.jar:lib/javax.el-3.0.0.jar:lib/javassist-3.22.0-GA.jar:lib/dropwizard-configuration-1.3.8.jar:lib/jackson-dataformat-yaml-2.9.6.jar:lib/snakeyaml-1.18.jar:lib/dropwizard-logging-1.3.8.jar:lib/metrics-logback-4.0.2.jar:lib/jul-to-slf4j-1.7.25.jar:lib/logback-core-1.2.3.jar:lib/logback-classic-1.2.3.jar:lib/jcl-over-slf4j-1.7.25.jar:lib/jetty-util-9.4.14.v20181114.jar:lib/dropwizard-metrics-1.3.8.jar:lib/dropwizard-lifecycle-1.3.8.jar:lib/javax.servlet-api-3.1.0.jar:lib/jetty-server-9.4.14.v20181114.jar:lib/jetty-http-9.4.14.v20181114.jar:lib/jetty-io-9.4.14.v20181114.jar:lib/dropwizard-jersey-1.3.8.jar:lib/jersey-server-2.25.1.jar:lib/jersey-common-2.25.1.jar:lib/javax.ws.rs-api-2.0.1.jar:lib/jersey-guava-2.25.1.jar:lib/hk2-api-2.5.0-b32.jar:lib/hk2-utils-2.5.0-b32.jar:lib/aopalliance-repackaged-2.5.0-b32.jar:lib/javax.inject-2.5.0-b32.jar:lib/hk2-locator-2.5.0-b32.jar:lib/osgi-resource-locator-1.0.1.jar:lib/jersey-client-2.25.1.jar:lib/jersey-media-jaxb-2.25.1.jar:lib/jersey-metainf-services-2.25.1.jar:lib/jersey-bean-validation-2.25.1.jar:lib/metrics-jersey2-4.0.2.jar:lib/metrics-annotation-4.0.2.jar:lib/jackson-jaxrs-json-provider-2.9.6.jar:lib/jackson-jaxrs-base-2.9.6.jar:lib/jackson-module-jaxb-annotations-2.9.6.jar:lib/jersey-container-servlet-2.25.1.jar:lib/jersey-container-servlet-core-2.25.1.jar:lib/jetty-webapp-9.4.14.v20181114.jar:lib/jetty-xml-9.4.14.v20181114.jar:lib/jetty-servlet-9.4.14.v20181114.jar:lib/jetty-security-9.4.14.v20181114.jar:lib/jetty-continuation-9.4.14.v20181114.jar:lib/dropwizard-servlets-1.3.8.jar:lib/dropwizard-jetty-1.3.8.jar:lib/metrics-jetty9-4.0.2.jar:lib/jetty-servlets-9.4.14.v20181114.jar:lib/metrics-jvm-4.0.2.jar:lib/metrics-jmx-4.0.2.jar:lib/metrics-servlets-4.0.2.jar:lib/metrics-healthchecks-4.0.2.jar:lib/metrics-json-4.0.2.jar:lib/profiler-1.0.2.jar:lib/dropwizard-request-logging-1.3.8.jar:lib/logback-access-1.2.3.jar:lib/argparse4j-0.8.1.jar:lib/jetty-setuid-java-1.0.3.jar:lib/dropwizard-auth-1.3.8.jar:doctorkafka-0.2.4.3.jar INFO [2019-01-17 21:15:44,524] org.apache.zookeeper.ZooKeeper: Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib INFO [2019-01-17 21:15:44,524] org.apache.zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp INFO [2019-01-17 21:15:44,524] org.apache.zookeeper.ZooKeeper: Client environment:java.compiler= INFO [2019-01-17 21:15:44,524] org.apache.zookeeper.ZooKeeper: Client environment:os.name=Linux INFO [2019-01-17 21:15:44,524] org.apache.zookeeper.ZooKeeper: Client environment:os.arch=amd64 INFO [2019-01-17 21:15:44,524] org.apache.zookeeper.ZooKeeper: Client environment:os.version=3.10.0-862.3.2.el7.x86_64 INFO [2019-01-17 21:15:44,524] org.apache.zookeeper.ZooKeeper: Client environment:user.name=yuzhao INFO [2019-01-17 21:15:44,524] org.apache.zookeeper.ZooKeeper: Client environment:user.home=/home/yuzhao INFO [2019-01-17 21:15:44,524] org.apache.zookeeper.ZooKeeper: Client environment:user.dir=/home/yuzhao INFO [2019-01-17 21:15:44,525] org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=df_zookeeper_dfops_743_1_development_datafabric_kafka_aws_us_ea.aws.athenahealth.com:2181/ sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@77681ce4 INFO [2019-01-17 21:15:44,619] org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.129.107.71/10.129.107.71:2181. Will not attempt to authenticate using SASL (unknown error) INFO [2019-01-17 21:15:44,627] org.apache.curator.framework.imps.CuratorFrameworkImpl: Default schema INFO [2019-01-17 21:15:44,636] org.apache.zookeeper.ClientCnxn: Socket connection established to 10.129.107.71/10.129.107.71:2181, initiating session INFO [2019-01-17 21:15:44,648] io.dropwizard.server.ServerFactory: Starting DoctorKafkaMain INFO [2019-01-17 21:15:44,650] org.apache.zookeeper.ClientCnxn: Session establishment complete on server 10.129.107.71/10.129.107.71:2181, sessionid = 0x1684d33f9000020, negotiated timeout = 40000 INFO [2019-01-17 21:15:44,669] org.apache.curator.framework.state.ConnectionStateManager: State change: CONNECTED INFO [2019-01-17 21:15:44,845] org.eclipse.jetty.setuid.SetUIDListener: Opened application@6f044c58{HTTP/1.1,[http/1.1]}{0.0.0.0:8080} INFO [2019-01-17 21:15:44,847] org.eclipse.jetty.server.Server: jetty-9.4.14.v20181114; built: 2018-11-14T21:20:31.478Z; git: c4550056e785fb5665914545889f21dc136ad9e6; jvm 1.8.0_191-b12 INFO [2019-01-17 21:15:45,021] org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=df_zookeeper_dfops_743_1_development_datafabric_kafka_aws_us_ea.aws.athenahealth.com:2181/zookeeper_datafabric_development_dfops_743_1_kafka_aws_us_east_1_df_zookeeper sessionTimeout=30000 watcher=org.I0Itec.zkclient.ZkClient@277a8c9 INFO [2019-01-17 21:15:45,022] org.I0Itec.zkclient.ZkEventThread: Starting ZkClient event thread. INFO [2019-01-17 21:15:45,024] org.I0Itec.zkclient.ZkClient: Waiting for keeper state SyncConnected INFO [2019-01-17 21:15:45,025] org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.129.107.71/10.129.107.71:2181. Will not attempt to authenticate using SASL (unknown error) INFO [2019-01-17 21:15:45,027] org.apache.zookeeper.ClientCnxn: Socket connection established to 10.129.107.71/10.129.107.71:2181, initiating session INFO [2019-01-17 21:15:45,033] org.apache.zookeeper.ClientCnxn: Session establishment complete on server 10.129.107.71/10.129.107.71:2181, sessionid = 0x1684d33f9000021, negotiated timeout = 30000 INFO [2019-01-17 21:15:45,040] org.I0Itec.zkclient.ZkClient: zookeeper state changed (SyncConnected) INFO [2019-01-17 21:15:45,055] kafka.utils.Log4jControllerRegistration$: Registered kafka:type=kafka.Log4jController MBean INFO [2019-01-17 21:15:45,419] org.apache.kafka.clients.consumer.ConsumerConfig: ConsumerConfig values: auto.commit.interval.ms = 5000 auto.offset.reset = latest bootstrap.servers = [10.129.105.28:9092, 10.129.104.201:9092, 10.129.103.217:9092] check.crcs = true client.id = connections.max.idle.ms = 540000 enable.auto.commit = false exclude.internal.topics = true fetch.max.bytes = 52428800 fetch.max.wait.ms = 500 fetch.min.bytes = 1 group.id = doctorkafka heartbeat.interval.ms = 3000 interceptor.classes = [] internal.leave.group.on.close = true isolation.level = read_uncommitted key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer max.partition.fetch.bytes = 4194304 max.poll.interval.ms = 1 max.poll.records = 500 metadata.max.age.ms = 300000 metric.reporters = [] metrics.num.samples = 2 metrics.recording.level = INFO metrics.sample.window.ms = 30000 partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor] receive.buffer.bytes = 65536 reconnect.backoff.max.ms = 1000 reconnect.backoff.ms = 50 request.timeout.ms = 305000 retry.backoff.ms = 100 sasl.jaas.config = null sasl.kerberos.kinit.cmd = /usr/bin/kinit sasl.kerberos.min.time.before.relogin = 60000 sasl.kerberos.service.name = null sasl.kerberos.ticket.renew.jitter = 0.05 sasl.kerberos.ticket.renew.window.factor = 0.8 sasl.mechanism = GSSAPI security.protocol = PLAINTEXT send.buffer.bytes = 131072 session.timeout.ms = 10000 ssl.cipher.suites = null ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1] ssl.endpoint.identification.algorithm = null ssl.key.password = null ssl.keymanager.algorithm = SunX509 ssl.keystore.location = null ssl.keystore.password = null ssl.keystore.type = JKS ssl.protocol = TLS ssl.provider = null ssl.secure.random.implementation = null ssl.trustmanager.algorithm = PKIX ssl.truststore.location = null ssl.truststore.password = null ssl.truststore.type = JKS value.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer

INFO [2019-01-17 21:15:45,486] org.apache.kafka.common.utils.AppInfoParser: Kafka version : 1.1.0 INFO [2019-01-17 21:15:45,486] org.apache.kafka.common.utils.AppInfoParser: Kafka commitId : fdcf75ea326b8e07 INFO [2019-01-17 21:15:45,625] org.apache.kafka.clients.Metadata: Cluster ID: M4N2fNprQYKCrKhKzAH4zQ 21:15:45.676 [pool-6-thread-1] ERROR com.pinterest.doctorkafka.DoctorKafkaMain - DoctorKafka start failed java.lang.NullPointerException: null at com.pinterest.doctorkafka.replicastats.ReplicaStatsManager.getProcessingStartOffsets(ReplicaStatsManager.java:193) ~[doctorkafka-0.2.4.3.jar:?] at com.pinterest.doctorkafka.replicastats.ReplicaStatsManager.readPastReplicaStats(ReplicaStatsManager.java:216) ~[doctorkafka-0.2.4.3.jar:?] at com.pinterest.doctorkafka.DoctorKafka.start(DoctorKafka.java:51) ~[doctorkafka-0.2.4.3.jar:?] at com.pinterest.doctorkafka.DoctorKafkaMain.lambda$run$0(DoctorKafkaMain.java:68) ~[doctorkafka-0.2.4.3.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_191] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_191] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191] INFO [2019-01-17 21:15:45,782] io.dropwizard.jersey.DropwizardResourceConfig: The following paths were found for the configured resources:

GET     /api/cluster (com.pinterest.doctorkafka.api.ClusterApi)
DELETE  /api/cluster/{clusterName}/admin/maintenance (com.pinterest.doctorkafka.api.MaintenanceApi)
GET     /api/cluster/{clusterName}/admin/maintenance (com.pinterest.doctorkafka.api.MaintenanceApi)
PUT     /api/cluster/{clusterName}/admin/maintenance (com.pinterest.doctorkafka.api.MaintenanceApi)
GET     /api/cluster/{clusterName}/broker (com.pinterest.doctorkafka.api.BrokerApi)

INFO [2019-01-17 21:15:45,784] org.eclipse.jetty.server.handler.ContextHandler: Started i.d.j.MutableServletContextHandler@1961d75a{/,null,AVAILABLE} INFO [2019-01-17 21:15:45,797] org.eclipse.jetty.server.AbstractConnector: Started application@6f044c58{HTTP/1.1,[http/1.1]}{0.0.0.0:8080} INFO [2019-01-17 21:15:45,797] org.eclipse.jetty.server.Server: Started @3572ms

^C 21:17:16.003 [Thread-2] ERROR com.pinterest.doctorkafka.DoctorKafkaMain - Failure in stopping operator java.lang.NullPointerException: null at com.pinterest.doctorkafka.DoctorKafka.stop(DoctorKafka.java:78) ~[doctorkafka-0.2.4.3.jar:?] at com.pinterest.doctorkafka.DoctorKafkaMain$OperatorCleanupThread.run(DoctorKafkaMain.java:153) [doctorkafka-0.2.4.3.jar:?] INFO [2019-01-17 21:17:16,006] org.eclipse.jetty.server.AbstractConnector: Stopped application@6f044c58{HTTP/1.1,[http/1.1]}{0.0.0.0:8080} INFO [2019-01-17 21:17:16,012] org.eclipse.jetty.server.handler.ContextHandler: Stopped i.d.j.MutableServletContextHandler@1961d75a{/,null,UNAVAILABLE}

yhzhao commented 5 years ago

And brokerstats topic has nothing in. Does it mean the kafkastats process running on the brokers are not sending anything to the topic?

yuyang08 commented 5 years ago

can you tell the message format version that your cluster uses? Based on the exception, doctorkafka failed to find the starting offsets of kafkastats topic. you can use kafka command line tool kafka-console-consumer.sh to check if there are messages in the kafkastats topic.

https://github.com/pinterest/doctorkafka/blob/master/drkafka/src/main/java/com/pinterest/doctorkafka/replicastats/ReplicaStatsManager.java#L190

yhzhao commented 5 years ago

There is no message in that topic when using that command. What is the expected version of kafka?2.11-2.1.0 seems to work fine (at least I don't see this problem.)

yhzhao commented 5 years ago

2.11 - 1.1.0 seems to have problem.

yhzhao commented 5 years ago

Actually, 0.2.4 version of kafkastas does publish message to brokerstats topic. 0.2.4.3 does not. Feels like a version miss-match.

BrianGallew commented 5 years ago

I'm seeing the same issue. I've got a half dozen Kafka clusters with all their brokers running kafkastats-0.2.4.3. I've verified that they are writing data into the brokerstats topic. I've even truncated data to the last hour to ensure that any old stats data (written by an older kafkastats client) has been purged. But I, too, am getting the traceback in my log:

2019-01-23 20:53:18.892 [pool-6-thread-1] ERROR com.pinterest.doctorkafka.DoctorKafkaMain - DoctorKafka start failed
java.lang.NullPointerException: null
    at com.pinterest.doctorkafka.replicastats.ReplicaStatsManager.getProcessingStartOffsets(ReplicaStatsManager.java:193) ~[doctorkafka-0.2.4.3-jar-with-dependencies.jar:?]
    at com.pinterest.doctorkafka.replicastats.ReplicaStatsManager.readPastReplicaStats(ReplicaStatsManager.java:216) ~[doctorkafka-0.2.4.3-jar-with-dependencies.jar:?]
    at com.pinterest.doctorkafka.DoctorKafka.start(DoctorKafka.java:51) ~[doctorkafka-0.2.4.3-jar-with-dependencies.jar:?]
    at com.pinterest.doctorkafka.DoctorKafkaMain.lambda$run$0(DoctorKafkaMain.java:68) ~[doctorkafka-0.2.4.3-jar-with-dependencies.jar:?]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_191]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_191]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
BrianGallew commented 5 years ago

If I wrap ReplicaStatsManager.java:193 in a try/except bit, DoctorKafka starts up.

ambud commented 5 years ago

@yhzhao could you please check the Kafka message format of the broker stats topic? We are expecting at least 0.10.0 for Dr.Kafka 0.2.4.3 above. Starting 0.2.4.3 we switched to using the Kafka timeindex to find past offsets for broker stats, this helps improve the Dr.Kafka startup performance. The expected version of Kafka message format 0.10.0 or higher.

From the javadoc of kafkaConsumer.offsetsForTimes()

If the message format version in a partition is before 0.10.0, i.e. the messages do not have timestamps, null will be returned for that partition.

BrianGallew commented 5 years ago

In my case, that has been set to log.message.format.version=0.11.0, so that should be good.

BrianGallew commented 5 years ago

Also, I just set up a brand new doctorkafka installation, all 0.2.4.3, and am having the same issue here. I'm suspecting that the kafkastats daemons are occasionally sending insufficiently complete data that makes doctorkafka sad.

yuyang08 commented 5 years ago

@BrianGallew Thanks for the info! We will look into the issue and fix it. cc @ambud

BrianGallew commented 5 years ago

OK, the following diff will reliably 'fix' the problem, but of course it's just a bandage.

    for (Entry<TopicPartition, OffsetAndTimestamp> entry : offsetsForTimes.entrySet()) {
      if (entry.getValue() == null) {
    LOG.info("Faking a zero for {}", entry.getKey());
    partitionMap.put(entry.getKey(), 0L);
      } else {
    partitionMap.put(entry.getKey(), entry.getValue().offset());
      }
    }

I'm not sure, but I think the issue is that partitions which have never been written end up have no offset data, and doctorkafka really wants it to be there.

BrianGallew commented 5 years ago

And, yeah, that's not a diff, that's just lines 192-199 of ReplicaStatsManager.java. Here's an actual diff.

diff --git a/drkafka/src/main/java/com/pinterest/doctorkafka/replicastats/ReplicaStatsManager.java b/drkafka/src/main/java/com/pinterest/doctorkafka/replicastats/ReplicaStatsManager.java
index c680125..eee5b47 100644
--- a/drkafka/src/main/java/com/pinterest/doctorkafka/replicastats/ReplicaStatsManager.java
+++ b/drkafka/src/main/java/com/pinterest/doctorkafka/replicastats/ReplicaStatsManager.java
@@ -192,3 +192,8 @@ public class ReplicaStatsManager {
     for (Entry<TopicPartition, OffsetAndTimestamp> entry : offsetsForTimes.entrySet()) {
-      partitionMap.put(entry.getKey(), entry.getValue().offset());
+      if (entry.getValue() == null) {
+       LOG.info("Faking a zero for {}", entry.getKey());
+       partitionMap.put(entry.getKey(), 0L);
+      } else {
+       partitionMap.put(entry.getKey(), entry.getValue().offset());
+      }
     }
ambud commented 5 years ago

@BrianGallew could you please open a PR with the above diff?

The issue to investigate would be why are there no stats written to certain partitions of the stats topic.

BrianGallew commented 5 years ago

I can (and will!), but are we sure it doesn't just mask a real problem?

saherahwal commented 4 years ago

I am facing the same problem. Is this issue closed? Should I just mask the problem as it is in diff above?