Remove entries which caused provider API request failure in a batch change requst.

awx-fuyuanchu commented 1 year ago

What would you like to be added: I'm using Google provider and have an issue while updating DNS records via external-dns. After a few investigations, we found that there were some errors returned from googleapi and told us it failed to update the record. time="2023-01-11T06:29:22Z" level=error msg="googleapi: Error 409: The resource 'entity.change.additions[2]' named 'xxxxxxxxxxx. (TXT)' already exists, alreadyExists"

However, changes are separated into batches before sending to the provider. If one entry in the batch is not valid, it'll cause the whole batch to fail to update the provider.

Here is the log we found on GCP that shows the request containing batch changes failed.

{
  "protoPayload": {
    "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
    "status": {
      "code": 6
    },
    "authenticationInfo": {
      "principalEmail": "sa@xxxxxx",
      "serviceAccountDelegationInfo": [
        {
          "principalSubject": "serviceAccount:sa@xxx"
        }
      ],
      "principalSubject": "serviceAccount:sa@xxx"
    },
    "requestMetadata": {
      "callerIp": "gce-internal-ip",
      "callerSuppliedUserAgent": "google-api-go-client/0.5,gzip(gfe),gzip(gfe)",
      "requestAttributes": {
        "time": "2023-01-11T06:20:10.361036Z",
        "auth": {}
      },
      "destinationAttributes": {}
    },
    "serviceName": "dns.googleapis.com",
    "methodName": "dns.changes.create",
    "authorizationInfo": [
      {
        "permission": "dns.resourceRecordSets.create",
        "granted": true,
        "resourceAttributes": {}
      }
    ],
    "resourceName": "managedZones/example-com",
    "request": {
      "@type": "type.googleapis.com/cloud.dns.api.ChangesCreateRequest",
      "managedZone": "example-com",
      "change": {
        "deletions": [..........],
        "additions": [..........]
      },
      "project": "dnszone"
    },
    "response": {
      "@type": "type.googleapis.com/cloud.dns.api.ChangesCreateResponse"
    }
  },
  "insertId": "ak3c2te8h1pw",
  "resource": {
    "type": "dns_managed_zone",
    "labels": {
      "zone_name": "example-com",
      "project_id": "dnszone",
      "location": "global"
    }
  },
  "timestamp": "2023-01-11T06:20:10.330139Z",
  "severity": "ERROR",
  "logName": "",
  "receiveTimestamp": "2023-01-11T06:20:11.371460049Z"
}

So I'm here to request a feature that external-dns could identify the entries that break the request and remove them in the next loop. Maybe have another loop to handle the invalid entries.

Why is this needed: With this feature, an invalid record won't block the other records.

amitai-devops commented 1 year ago

I have the same issue

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

awx-fuyuanchu commented 1 year ago

bump. any thoughts?

awx-fuyuanchu commented 1 year ago

/remove-lifecycle stale

szuecs commented 1 year ago

This works mean that we would have state, which is bad in general. I think we do something in aws provider to work around the problem. I am not sure anymore but I think we split the change in half and try both and one will succeed and the other fail. Next iterations should fix the next quarter so we converge to a good state. People tried multiple times to fallback to single entry changes but this will quickly consume all api quotas and you end up being rate limited. I think we should do a binary search style apply in general. I can review a pr, if someone creates a change.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

lmoze-windscribe commented 11 months ago

/remove-lifecycle stale

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

lmoze-windscribe commented 6 months ago

/remove-lifecycle rotten

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

lmoze-windscribe commented 2 months ago

/remove-lifecycle rotten

kubernetes-sigs / external-dns

Remove entries which caused provider API request failure in a batch change requst. #3307