external-dns quietly stops working

davejab commented 1 year ago

What happened:

external-dns quietly stops executing, does not error and does not recover until pod is manually deleted

What you expected to happen:

Either for external-dns to continue executing as normal, or for it to error and register the pod as unhealthy, prompting a replacement.

How to reproduce it (as minimally and precisely as possible):

Unable to reproduce consistently, the issue is intermittent.

Anything else we need to know?:

Originally we thought we may have been hitting an api limit with AWS so we added --aws-zones-cache-duration=24h as this does not change in our environment, this has made no difference however.

Environment:

External-DNS version (use external-dns --version): v20230327-v0.13.4
DNS provider: AWS Route53
Helm Chart Version: 1.12.2

Helm Chart Values:


env:
- name: AWS_DEFAULT_REGION
value: eu-west-1
- name: AWS_STS_REGIONAL_ENDPOINTS
value: regional
- name: http_proxy
value: exampe.proxy
- name: https_proxy
value: example.proxy
- name: no_proxy
value: 169.254.169.254,s3.eu-west-1.amazonaws.com,172.20.0.1,sts.eu-west-1.amazonaws.com

txtPrefix: "registry-" policy: sync

extraArgs: [ "--aws-zones-cache-duration=24h" ]

logLevel: debug

resources: limits: cpu: 100m memory: 100Mi requests: cpu: 100m memory: 50Mi

podSecurityContext: fsGroup: 65534

securityContext: runAsNonRoot: true runAsUser: 65534 runAsGroup: 65534

image: repository: XXXXXXXXXX.dkr.ecr.eu-west-1.amazonaws.com/k8s.gcr.io/external-dns/external-dns

serviceAccount: annotations: eks.amazonaws.com/role-arn: arn:aws:iam::XXXXXXXXXX:role/external-dns

domainFilters: [ "example.zone" ]

- Logs:
Last logs before it stops working, note last log time 00:26 where pod is still "healthy" at 09:54

time="2023-04-27T00:25:19Z" level=debug msg="Using cached zones list" time="2023-04-27T00:25:19Z" level=debug msg="Adding external-dns-test-rzosxxpsexpluhg.example.zone. to zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]" time="2023-04-27T00:25:19Z" level=debug msg="Adding registry-external-dns-test-rzosxxpsexpluhg.example.zone. to zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]" time="2023-04-27T00:25:19Z" level=debug msg="Adding registry-cname-external-dns-test-rzosxxpsexpluhg.example.zone. to zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]" time="2023-04-27T00:25:19Z" level=debug msg="Adding external-dns-test-gbmhqlmrmvtxmgc.example.zone. to zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]" time="2023-04-27T00:25:19Z" level=debug msg="Adding registry-external-dns-test-gbmhqlmrmvtxmgc.example.zone. to zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]" time="2023-04-27T00:25:19Z" level=debug msg="Adding registry-cname-external-dns-test-gbmhqlmrmvtxmgc.example.zone. to zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]" time="2023-04-27T00:25:19Z" level=info msg="Desired change: DELETE external-dns-test-gbmhqlmrmvtxmgc.example.zone CNAME [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]" time="2023-04-27T00:25:19Z" level=info msg="Desired change: DELETE registry-cname-external-dns-test-gbmhqlmrmvtxmgc.example.zone TXT [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]" time="2023-04-27T00:25:19Z" level=info msg="Desired change: DELETE registry-external-dns-test-gbmhqlmrmvtxmgc.example.zone TXT [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]" time="2023-04-27T00:25:19Z" level=info msg="Desired change: CREATE external-dns-test-rzosxxpsexpluhg.example.zone CNAME [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]" time="2023-04-27T00:25:19Z" level=info msg="Desired change: CREATE registry-cname-external-dns-test-rzosxxpsexpluhg.example.zone TXT [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]" time="2023-04-27T00:25:19Z" level=info msg="Desired change: CREATE registry-external-dns-test-rzosxxpsexpluhg.example.zone TXT [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]" time="2023-04-27T00:25:19Z" level=info msg="6 record(s) in zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX] were successfully updated" time="2023-04-27T00:26:20Z" level=debug msg="Using cached zones list"

JackFlukinger commented 1 year ago

I'm having this same issue. For context, I am running with these args:

"--source=ingress", "--provider=aws", "--aws-zone-type=public", "--aws-prefer-cname", "--registry=txt", "--txt-owner-id=external-dns-${var.name}", "--txt-prefix=external-dns"

szuecs commented 1 year ago

So you expect any kind of health log line?

Right now I don't see that's a bug but maybe you can explain it to us. Did an ingress change and external-dns didn't update the records?

davejab commented 1 year ago

@szuecs the issue is that the application stops processing with no indication of why and requires manual intervention (deleting of the pod) before it can start processing again. I would expect at the very least here that the pod would become aware of this and intervene before it became a problem.

bocan commented 1 year ago

Also, when this happens, the livenessprobe and the readinessprobe never get tripped. /healthz on port 80 still merrily reports that everything is fine.

szuecs commented 1 year ago

I think we run it close to the same (no helm) and don't really see any issue like that in 200 clusters, that is why I wonder. I need more Information to understand what happens.

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 6 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/external-dns/issues/3574#issuecomment-2014327673): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

TLmaK0 commented 6 months ago

/reopen

I see this behavior also, these are the logs from the single pod:

external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="config: {APIServerURL: KubeConfig: RequestTimeout:30s DefaultTargets:[] GlooNamespaces:[gloo-system] SkipperRouteGroupVersion:zalando.org
/v1 Sources:[service ingress] Namespace: AnnotationFilter: LabelFilter: IngressClassNames:[] FQDNTemplate: CombineFQDNAndAnnotation:false IgnoreHostnameAnnotation:false IgnoreIngressTLSSpec:false IgnoreIngressRu
lesSpec:false GatewayNamespace: GatewayLabelFilter: Compatibility: PublishInternal:false PublishHostIP:false AlwaysPublishNotReadyAddresses:false ConnectorSourceServer:localhost:8080 Provider:aws GoogleProject: 
GoogleBatchChangeSize:1000 GoogleBatchChangeInterval:1s GoogleZoneVisibility: DomainFilter:[] ExcludeDomains:[] RegexDomainFilter: RegexDomainExclusion: ZoneNameFilter:[] ZoneIDFilter:[] TargetNetFilter:[] Exclu
deTargetNets:[] AlibabaCloudConfigFile:/etc/kubernetes/alibaba-cloud.json AlibabaCloudZoneType: AWSZoneType: AWSZoneTagFilter:[] AWSAssumeRole: AWSAssumeRoleExternalID: AWSBatchChangeSize:1000 AWSBatchChangeInte
rval:1s AWSEvaluateTargetHealth:true AWSAPIRetries:3 AWSPreferCNAME:false AWSZoneCacheDuration:0s AWSSDServiceCleanup:false AWSDynamoDBRegion: AWSDynamoDBTable:external-dns AzureConfigFile:/etc/kubernetes/azure.
json AzureResourceGroup: AzureSubscriptionID: AzureUserAssignedIdentityClientID: BluecatDNSConfiguration: BluecatConfigFile:/etc/kubernetes/bluecat.json BluecatDNSView: BluecatGatewayHost: BluecatRootZone: Bluec
atDNSServerName: BluecatDNSDeployType:no-deploy BluecatSkipTLSVerify:false CloudflareProxied:false CloudflareDNSRecordsPerPage:100 CoreDNSPrefix:/skydns/ RcodezeroTXTEncrypt:false AkamaiServiceConsumerDomain: Ak
amaiClientToken: AkamaiClientSecret: AkamaiAccessToken: AkamaiEdgercPath: AkamaiEdgercSection: InfobloxGridHost: InfobloxWapiPort:443 InfobloxWapiUsername:admin InfobloxWapiPassword: InfobloxWapiVersion:2.3.1 In
fobloxSSLVerify:true InfobloxView: InfobloxMaxResults:0 InfobloxFQDNRegEx: InfobloxNameRegEx: InfobloxCreatePTR:false InfobloxCacheDuration:0 DynCustomerName: DynUsername: DynPassword: DynMinTTLSeconds:0 OCIConf
igFile:/etc/kubernetes/oci.yaml OCICompartmentOCID: OCIAuthInstancePrincipal:false InMemoryZones:[] OVHEndpoint:ovh-eu OVHApiRateLimit:20 PDNSServer:http://localhost:8081 PDNSAPIKey: PDNSSkipTLSVerify:false TLSC
A: TLSClientCert: TLSClientCertKey: Policy:sync Registry:txt TXTOwnerID:external-dns TXTPrefix: TXTSuffix: TXTEncryptEnabled:false TXTEncryptAESKey: Interval:1m0s MinEventSyncInterval:5s Once:false DryRun:false 
UpdateEvents:false LogFormat:text MetricsAddress::7979 LogLevel:info TXTCacheInterval:0s TXTWildcardReplacement: ExoscaleEndpoint: ExoscaleAPIKey: ExoscaleAPISecret: ExoscaleAPIEnvironment:api ExoscaleAPIZone:ch
-gva-2 CRDSourceAPIVersion:externaldns.k8s.io/v1alpha1 CRDSourceKind:DNSEndpoint ServiceTypeFilter:[] CFAPIEndpoint: CFUsername: CFPassword: ResolveServiceLoadBalancerHostname:false RFC2136Host: RFC2136Port:0 RF
C2136Zone: RFC2136Insecure:false RFC2136GSSTSIG:false RFC2136KerberosRealm: RFC2136KerberosUsername: RFC2136KerberosPassword: RFC2136TSIGKeyName: RFC2136TSIGSecret: RFC2136TSIGSecretAlg: RFC2136TAXFR:false RFC21
36MinTTL:0s RFC2136BatchChangeSize:50 NS1Endpoint: NS1IgnoreSSL:false NS1MinTTLSeconds:0 TransIPAccountName: TransIPPrivateKeyFile: DigitalOceanAPIPageSize:50 ManagedDNSRecordTypes:[A AAAA CNAME] ExcludeDNSRecor
dTypes:[] GoDaddyAPIKey: GoDaddySecretKey: GoDaddyTTL:0 GoDaddyOTE:false OCPRouterName: IBMCloudProxied:false IBMCloudConfigFile:/etc/kubernetes/ibmcloud.json TencentCloudConfigFile:/etc/kubernetes/tencent-cloud
.json TencentCloudZoneType: PiholeServer: PiholePassword: PiholeTLSInsecureSkipVerify:false PluralCluster: PluralProvider: WebhookProviderURL:http://localhost:8888 WebhookProviderReadTimeout:5s WebhookProviderWr
iteTimeout:10s WebhookServer:false}"                                                                                                                                                                               
external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="Instantiating new Kubernetes client"                                                                                                     
external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="Using inCluster-config based on serviceaccount-token"                                                                                    
external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="Created Kubernetes client https://172.20.0.1:443"

Killing the pod or deleting the deployment and recreating, doesn't solve the problem.

k8s-ci-robot commented 6 months ago

@TLmaK0: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes-sigs/external-dns/issues/3574#issuecomment-2014461410): >/reopen > >I see this behavior also, these are the logs from the single pod: > >``` >external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="config: {APIServerURL: KubeConfig: RequestTimeout:30s DefaultTargets:[] GlooNamespaces:[gloo-system] SkipperRouteGroupVersion:zalando.org >/v1 Sources:[service ingress] Namespace: AnnotationFilter: LabelFilter: IngressClassNames:[] FQDNTemplate: CombineFQDNAndAnnotation:false IgnoreHostnameAnnotation:false IgnoreIngressTLSSpec:false IgnoreIngressRu >lesSpec:false GatewayNamespace: GatewayLabelFilter: Compatibility: PublishInternal:false PublishHostIP:false AlwaysPublishNotReadyAddresses:false ConnectorSourceServer:localhost:8080 Provider:aws GoogleProject: >GoogleBatchChangeSize:1000 GoogleBatchChangeInterval:1s GoogleZoneVisibility: DomainFilter:[] ExcludeDomains:[] RegexDomainFilter: RegexDomainExclusion: ZoneNameFilter:[] ZoneIDFilter:[] TargetNetFilter:[] Exclu >deTargetNets:[] AlibabaCloudConfigFile:/etc/kubernetes/alibaba-cloud.json AlibabaCloudZoneType: AWSZoneType: AWSZoneTagFilter:[] AWSAssumeRole: AWSAssumeRoleExternalID: AWSBatchChangeSize:1000 AWSBatchChangeInte >rval:1s AWSEvaluateTargetHealth:true AWSAPIRetries:3 AWSPreferCNAME:false AWSZoneCacheDuration:0s AWSSDServiceCleanup:false AWSDynamoDBRegion: AWSDynamoDBTable:external-dns AzureConfigFile:/etc/kubernetes/azure. >json AzureResourceGroup: AzureSubscriptionID: AzureUserAssignedIdentityClientID: BluecatDNSConfiguration: BluecatConfigFile:/etc/kubernetes/bluecat.json BluecatDNSView: BluecatGatewayHost: BluecatRootZone: Bluec >atDNSServerName: BluecatDNSDeployType:no-deploy BluecatSkipTLSVerify:false CloudflareProxied:false CloudflareDNSRecordsPerPage:100 CoreDNSPrefix:/skydns/ RcodezeroTXTEncrypt:false AkamaiServiceConsumerDomain: Ak >amaiClientToken: AkamaiClientSecret: AkamaiAccessToken: AkamaiEdgercPath: AkamaiEdgercSection: InfobloxGridHost: InfobloxWapiPort:443 InfobloxWapiUsername:admin InfobloxWapiPassword: InfobloxWapiVersion:2.3.1 In >fobloxSSLVerify:true InfobloxView: InfobloxMaxResults:0 InfobloxFQDNRegEx: InfobloxNameRegEx: InfobloxCreatePTR:false InfobloxCacheDuration:0 DynCustomerName: DynUsername: DynPassword: DynMinTTLSeconds:0 OCIConf >igFile:/etc/kubernetes/oci.yaml OCICompartmentOCID: OCIAuthInstancePrincipal:false InMemoryZones:[] OVHEndpoint:ovh-eu OVHApiRateLimit:20 PDNSServer:http://localhost:8081 PDNSAPIKey: PDNSSkipTLSVerify:false TLSC >A: TLSClientCert: TLSClientCertKey: Policy:sync Registry:txt TXTOwnerID:external-dns TXTPrefix: TXTSuffix: TXTEncryptEnabled:false TXTEncryptAESKey: Interval:1m0s MinEventSyncInterval:5s Once:false DryRun:false >UpdateEvents:false LogFormat:text MetricsAddress::7979 LogLevel:info TXTCacheInterval:0s TXTWildcardReplacement: ExoscaleEndpoint: ExoscaleAPIKey: ExoscaleAPISecret: ExoscaleAPIEnvironment:api ExoscaleAPIZone:ch >-gva-2 CRDSourceAPIVersion:externaldns.k8s.io/v1alpha1 CRDSourceKind:DNSEndpoint ServiceTypeFilter:[] CFAPIEndpoint: CFUsername: CFPassword: ResolveServiceLoadBalancerHostname:false RFC2136Host: RFC2136Port:0 RF >C2136Zone: RFC2136Insecure:false RFC2136GSSTSIG:false RFC2136KerberosRealm: RFC2136KerberosUsername: RFC2136KerberosPassword: RFC2136TSIGKeyName: RFC2136TSIGSecret: RFC2136TSIGSecretAlg: RFC2136TAXFR:false RFC21 >36MinTTL:0s RFC2136BatchChangeSize:50 NS1Endpoint: NS1IgnoreSSL:false NS1MinTTLSeconds:0 TransIPAccountName: TransIPPrivateKeyFile: DigitalOceanAPIPageSize:50 ManagedDNSRecordTypes:[A AAAA CNAME] ExcludeDNSRecor >dTypes:[] GoDaddyAPIKey: GoDaddySecretKey: GoDaddyTTL:0 GoDaddyOTE:false OCPRouterName: IBMCloudProxied:false IBMCloudConfigFile:/etc/kubernetes/ibmcloud.json TencentCloudConfigFile:/etc/kubernetes/tencent-cloud >.json TencentCloudZoneType: PiholeServer: PiholePassword: PiholeTLSInsecureSkipVerify:false PluralCluster: PluralProvider: WebhookProviderURL:http://localhost:8888 WebhookProviderReadTimeout:5s WebhookProviderWr >iteTimeout:10s WebhookServer:false}" >external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="Instantiating new Kubernetes client" >external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="Using inCluster-config based on serviceaccount-token" >external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="Created Kubernetes client https://172.20.0.1:443" >``` >Killing the pod or deleting the deployment and recreating, doesn't solve the problem. > > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / external-dns

external-dns quietly stops working #3574