mikenorgate commented 6 years ago

Description

Heapster is unable to send events to Kafka after the Kafka cluster has been torn down and restarted

Steps to reproduce:

Setup Heapster with a Kafka sink /heapster --source=kubernetes:https://kubernetes.default --sink=kafka:?brokers=bootstrap.kafka.svc.cluster.local:9092
Tear down and restart all pods making the Kafka cluster

Errors from Heapster logs

W0418 19:37:25.141498       1 manager.go:119] Failed to push data to sink: Apache Kafka Sink
E0418 19:37:26.095386       1 driver.go:55] Failed to produce metric message: failed to produce message to heapster-metrics: cannot fetch metadata. No topics created?
E0418 19:37:27.898317       1 driver.go:55] Failed to produce metric message: failed to produce message to heapster-metrics: cannot fetch metadata. No topics created?
E0418 19:37:29.723280       1 driver.go:55] Failed to produce metric message: failed to produce message to heapster-metrics: cannot fetch metadata. No topics created?
E0418 19:37:31.589612       1 driver.go:55] Failed to produce metric message: failed to produce message to heapster-metrics: cannot fetch metadata. No topics created?

Output of heapster --version:

1.4.3

bartebor commented 6 years ago

The same happens in version 1.5.2.

Kokan commented 6 years ago

Is it possible that it is the same issue as #1672 ?

bartebor commented 6 years ago

1672 looks like a some DNS cache problem, so after IP change it no longer works. It is possible that OP is in this situation because of using cluster DNS.

In my case however there is no IP change ("static" DNS), just broker restart, possibly with some maintenance measures, i.e. topic truncation.

mikenorgate commented 6 years ago

I am using cluster DNS so it could well be a DNS cache issue that I am seeing

Kokan commented 6 years ago

I think the root cause is simple -in both cases - that in case of error, the connection is not properly re-created. You could see that kind of behavior with other sinks, in case of failure the producer is re-created.

I had a patch, but now it is lost at my former company private gitlab, and probably the core itself moved along also. If you want to address this I could give you more guide, but I do not see that I am going to have time to fix it.

fejta-bot commented 6 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 6 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 6 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 6 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/heapster/issues/2017#issuecomment-429639676): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-retired / heapster

Kafka sink does not recover after Kafka restart #2017

1672 looks like a some DNS cache problem, so after IP change it no longer works. It is possible that OP is in this situation because of using cluster DNS.