Closed davejab closed 6 months ago
I'm having this same issue. For context, I am running with these args:
"--source=ingress", "--provider=aws", "--aws-zone-type=public", "--aws-prefer-cname", "--registry=txt", "--txt-owner-id=external-dns-${var.name}", "--txt-prefix=external-dns"
So you expect any kind of health log line?
Right now I don't see that's a bug but maybe you can explain it to us. Did an ingress change and external-dns didn't update the records?
@szuecs the issue is that the application stops processing with no indication of why and requires manual intervention (deleting of the pod) before it can start processing again. I would expect at the very least here that the pod would become aware of this and intervene before it became a problem.
Also, when this happens, the livenessprobe and the readinessprobe never get tripped. /healthz on port 80 still merrily reports that everything is fine.
I think we run it close to the same (no helm) and don't really see any issue like that in 200 clusters, that is why I wonder. I need more Information to understand what happens.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
/reopen
I see this behavior also, these are the logs from the single pod:
external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="config: {APIServerURL: KubeConfig: RequestTimeout:30s DefaultTargets:[] GlooNamespaces:[gloo-system] SkipperRouteGroupVersion:zalando.org
/v1 Sources:[service ingress] Namespace: AnnotationFilter: LabelFilter: IngressClassNames:[] FQDNTemplate: CombineFQDNAndAnnotation:false IgnoreHostnameAnnotation:false IgnoreIngressTLSSpec:false IgnoreIngressRu
lesSpec:false GatewayNamespace: GatewayLabelFilter: Compatibility: PublishInternal:false PublishHostIP:false AlwaysPublishNotReadyAddresses:false ConnectorSourceServer:localhost:8080 Provider:aws GoogleProject:
GoogleBatchChangeSize:1000 GoogleBatchChangeInterval:1s GoogleZoneVisibility: DomainFilter:[] ExcludeDomains:[] RegexDomainFilter: RegexDomainExclusion: ZoneNameFilter:[] ZoneIDFilter:[] TargetNetFilter:[] Exclu
deTargetNets:[] AlibabaCloudConfigFile:/etc/kubernetes/alibaba-cloud.json AlibabaCloudZoneType: AWSZoneType: AWSZoneTagFilter:[] AWSAssumeRole: AWSAssumeRoleExternalID: AWSBatchChangeSize:1000 AWSBatchChangeInte
rval:1s AWSEvaluateTargetHealth:true AWSAPIRetries:3 AWSPreferCNAME:false AWSZoneCacheDuration:0s AWSSDServiceCleanup:false AWSDynamoDBRegion: AWSDynamoDBTable:external-dns AzureConfigFile:/etc/kubernetes/azure.
json AzureResourceGroup: AzureSubscriptionID: AzureUserAssignedIdentityClientID: BluecatDNSConfiguration: BluecatConfigFile:/etc/kubernetes/bluecat.json BluecatDNSView: BluecatGatewayHost: BluecatRootZone: Bluec
atDNSServerName: BluecatDNSDeployType:no-deploy BluecatSkipTLSVerify:false CloudflareProxied:false CloudflareDNSRecordsPerPage:100 CoreDNSPrefix:/skydns/ RcodezeroTXTEncrypt:false AkamaiServiceConsumerDomain: Ak
amaiClientToken: AkamaiClientSecret: AkamaiAccessToken: AkamaiEdgercPath: AkamaiEdgercSection: InfobloxGridHost: InfobloxWapiPort:443 InfobloxWapiUsername:admin InfobloxWapiPassword: InfobloxWapiVersion:2.3.1 In
fobloxSSLVerify:true InfobloxView: InfobloxMaxResults:0 InfobloxFQDNRegEx: InfobloxNameRegEx: InfobloxCreatePTR:false InfobloxCacheDuration:0 DynCustomerName: DynUsername: DynPassword: DynMinTTLSeconds:0 OCIConf
igFile:/etc/kubernetes/oci.yaml OCICompartmentOCID: OCIAuthInstancePrincipal:false InMemoryZones:[] OVHEndpoint:ovh-eu OVHApiRateLimit:20 PDNSServer:http://localhost:8081 PDNSAPIKey: PDNSSkipTLSVerify:false TLSC
A: TLSClientCert: TLSClientCertKey: Policy:sync Registry:txt TXTOwnerID:external-dns TXTPrefix: TXTSuffix: TXTEncryptEnabled:false TXTEncryptAESKey: Interval:1m0s MinEventSyncInterval:5s Once:false DryRun:false
UpdateEvents:false LogFormat:text MetricsAddress::7979 LogLevel:info TXTCacheInterval:0s TXTWildcardReplacement: ExoscaleEndpoint: ExoscaleAPIKey: ExoscaleAPISecret: ExoscaleAPIEnvironment:api ExoscaleAPIZone:ch
-gva-2 CRDSourceAPIVersion:externaldns.k8s.io/v1alpha1 CRDSourceKind:DNSEndpoint ServiceTypeFilter:[] CFAPIEndpoint: CFUsername: CFPassword: ResolveServiceLoadBalancerHostname:false RFC2136Host: RFC2136Port:0 RF
C2136Zone: RFC2136Insecure:false RFC2136GSSTSIG:false RFC2136KerberosRealm: RFC2136KerberosUsername: RFC2136KerberosPassword: RFC2136TSIGKeyName: RFC2136TSIGSecret: RFC2136TSIGSecretAlg: RFC2136TAXFR:false RFC21
36MinTTL:0s RFC2136BatchChangeSize:50 NS1Endpoint: NS1IgnoreSSL:false NS1MinTTLSeconds:0 TransIPAccountName: TransIPPrivateKeyFile: DigitalOceanAPIPageSize:50 ManagedDNSRecordTypes:[A AAAA CNAME] ExcludeDNSRecor
dTypes:[] GoDaddyAPIKey: GoDaddySecretKey: GoDaddyTTL:0 GoDaddyOTE:false OCPRouterName: IBMCloudProxied:false IBMCloudConfigFile:/etc/kubernetes/ibmcloud.json TencentCloudConfigFile:/etc/kubernetes/tencent-cloud
.json TencentCloudZoneType: PiholeServer: PiholePassword: PiholeTLSInsecureSkipVerify:false PluralCluster: PluralProvider: WebhookProviderURL:http://localhost:8888 WebhookProviderReadTimeout:5s WebhookProviderWr
iteTimeout:10s WebhookServer:false}"
external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="Instantiating new Kubernetes client"
external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="Using inCluster-config based on serviceaccount-token"
external-dns-664c9c75b5-4rffw time="2024-03-22T06:30:18Z" level=info msg="Created Kubernetes client https://172.20.0.1:443"
Killing the pod or deleting the deployment and recreating, doesn't solve the problem.
@TLmaK0: You can't reopen an issue/PR unless you authored it or you are a collaborator.
What happened:
external-dns quietly stops executing, does not error and does not recover until pod is manually deleted
What you expected to happen:
Either for external-dns to continue executing as normal, or for it to error and register the pod as unhealthy, prompting a replacement.
How to reproduce it (as minimally and precisely as possible):
Unable to reproduce consistently, the issue is intermittent.
Anything else we need to know?:
Originally we thought we may have been hitting an api limit with AWS so we added
--aws-zones-cache-duration=24h
as this does not change in our environment, this has made no difference however.Environment:
external-dns --version
): v20230327-v0.13.4txtPrefix: "registry-" policy: sync
extraArgs: [ "--aws-zones-cache-duration=24h" ]
logLevel: debug
resources: limits: cpu: 100m memory: 100Mi requests: cpu: 100m memory: 50Mi
podSecurityContext: fsGroup: 65534
securityContext: runAsNonRoot: true runAsUser: 65534 runAsGroup: 65534
image: repository: XXXXXXXXXX.dkr.ecr.eu-west-1.amazonaws.com/k8s.gcr.io/external-dns/external-dns
serviceAccount: annotations: eks.amazonaws.com/role-arn: arn:aws:iam::XXXXXXXXXX:role/external-dns
domainFilters: [ "example.zone" ]
time="2023-04-27T00:25:19Z" level=debug msg="Using cached zones list" time="2023-04-27T00:25:19Z" level=debug msg="Adding external-dns-test-rzosxxpsexpluhg.example.zone. to zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]" time="2023-04-27T00:25:19Z" level=debug msg="Adding registry-external-dns-test-rzosxxpsexpluhg.example.zone. to zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]" time="2023-04-27T00:25:19Z" level=debug msg="Adding registry-cname-external-dns-test-rzosxxpsexpluhg.example.zone. to zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]" time="2023-04-27T00:25:19Z" level=debug msg="Adding external-dns-test-gbmhqlmrmvtxmgc.example.zone. to zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]" time="2023-04-27T00:25:19Z" level=debug msg="Adding registry-external-dns-test-gbmhqlmrmvtxmgc.example.zone. to zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]" time="2023-04-27T00:25:19Z" level=debug msg="Adding registry-cname-external-dns-test-gbmhqlmrmvtxmgc.example.zone. to zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]" time="2023-04-27T00:25:19Z" level=info msg="Desired change: DELETE external-dns-test-gbmhqlmrmvtxmgc.example.zone CNAME [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]" time="2023-04-27T00:25:19Z" level=info msg="Desired change: DELETE registry-cname-external-dns-test-gbmhqlmrmvtxmgc.example.zone TXT [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]" time="2023-04-27T00:25:19Z" level=info msg="Desired change: DELETE registry-external-dns-test-gbmhqlmrmvtxmgc.example.zone TXT [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]" time="2023-04-27T00:25:19Z" level=info msg="Desired change: CREATE external-dns-test-rzosxxpsexpluhg.example.zone CNAME [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]" time="2023-04-27T00:25:19Z" level=info msg="Desired change: CREATE registry-cname-external-dns-test-rzosxxpsexpluhg.example.zone TXT [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]" time="2023-04-27T00:25:19Z" level=info msg="Desired change: CREATE registry-external-dns-test-rzosxxpsexpluhg.example.zone TXT [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX]" time="2023-04-27T00:25:19Z" level=info msg="6 record(s) in zone example.zone. [Id: /hostedzone/ZXXXXXXXXXXXXXXXXXXXX] were successfully updated" time="2023-04-27T00:26:20Z" level=debug msg="Using cached zones list"