pulumi / pulumi-gcp

A Google Cloud Platform (GCP) Pulumi resource package, providing multi-language access to GCP
Apache License 2.0
178 stars 52 forks source link

timeout importing recordset #375

Closed martaver closed 4 years ago

martaver commented 4 years ago

I have some dns records I'm trying to import with pulumi, and they fail somewhat bluntly with this error:

Diagnostics:
  gcp:dns:RecordSet (root/my.domain./NS):
    error: Preview failed: refreshing urn:pulumi:root::root::gcp:dns/recordSet:RecordSet::root/my.domain./NS: Error when reading or editing DNS Record Set "my.domain.": Get "https://www.googleapis.com/dns/v1beta2/projects/root-280012/managedZones/root/rrsets?alt=json&name=my.domain.&prettyPrint=false&type=NS": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

  pulumi:pulumi:Stack (root-root):
    error: preview failed

I'm just getting started with pulumi, so I have no real sense of whether this is a GCP-specific problem or more general with pulumi, so apologies if this is in the wrong place.

Is this just a case of increasing a timeout limit? Is this a problem with the cli? Why would this particular request timeout? (It times out every attempt)

Appreciate any advice!

martaver commented 4 years ago

Can't create them either:

Diagnostics:
  gcp:dns:RecordSet (root/my.domain./NS):
    error: Error retrieving record sets for "root": Get "https://www.googleapis.com/dns/v1beta2/projects/root-280012/managedZones/root/rrsets?alt=json&prettyPrint=false": Post "https://oauth2.googleapis.com/token": net/http: TLS handshake timeout (Client.Timeout exceeded while awaiting headers)

  gcp:dns:RecordSet (root/my.domain./MX):
    error: Error retrieving record sets for "root": Get "https://www.googleapis.com/dns/v1beta2/projects/root-280012/managedZones/root/rrsets?alt=json&prettyPrint=false": Post "https://oauth2.googleapis.com/token": net/http: TLS handshake timeout (Client.Timeout exceeded while awaiting headers)

  pulumi:pulumi:Stack (root-root):
    error: update failed

Timeouts... timeouts everywhere... I seem to be getting a lot of these...

martaver commented 4 years ago

Solved using customTimeouts: https://www.pulumi.com/docs/intro/concepts/programming-model/#customtimeouts

martaver commented 4 years ago

To follow up, I'm still getting timeouts here, regardless of setting customTimeouts, with varying failure points, all immediately after 1m has elapsed.

customTimeouts only sets timeouts for create/update/delete operations... but where does an import get its timeout from?

Maybe consider re-opening this issue and digging deeper? @leezen

martaver commented 4 years ago

Actually the timeout errors also persist on create operations, and sometimes even fail before reaching 1m. e.g.

Diagnostics:
  gcp:dns:RecordSet (root/www.my.domain./CNAME):
    error: Error retrieving record sets for "root": Get "https://www.googleapis.com/dns/v1beta2/projects/root-280012/managedZones/root/rrsets?alt=json&prettyPrint=false": Post "https://oauth2.googleapis.com/token": net/http: TLS handshake timeout (Client.Timeout exceeded while awaiting headers)

  pulumi:pulumi:Stack (root-root):
    error: update failed

Resources:
    9 unchanged

Duration: 38s

This is a major blocker for us in pulumi, because there doesn't seem to be a reasonable workaround... we can't reliably import infrastructure, nor can we re-create it. If there isn't a way to move beyond this, we'll be forced to fall back to terraform.

leezen commented 4 years ago

Sorry -- I closed this issue after you said you had solved it. Could you please provide the code you're running?

martaver commented 4 years ago

Sure:

// The zone exists and is imported correctly...
new gcp.dns.ManagedZone('root', {
    dnsName: `${ORG_TLD}.`,
    name: 'root',
    description: `...` // truncated
});

// Trying to create (because import failed with timeouts) a CNAME record:
// Other records have succeeded with the same managedZone name.
new gcp.dns.RecordSet(`root/my.domain./CNAME`, {
    managedZone: 'root',
    name: 'www.my.domain.',
    type: 'CNAME',
    ttl: 300,
    rrdatas: [
        'my.domain.'
    ]
}, {
    customTimeouts: {
        create: '5m',
        update: '5m',
        delete: '5m'
    }
})

Causes:

Updating (root):
     Type                  Name                      Status                  Info
     pulumi:pulumi:Stack   root-root                 **failed**              1 error
 +   └─ gcp:dns:RecordSet  root/my.domain./CNAME  **creating failed**     1 error

Diagnostics:
  pulumi:pulumi:Stack (root-root):
    error: update failed

  gcp:dns:RecordSet (root/my.domain./CNAME):
    error: Error retrieving record sets for "root": Get "https://www.googleapis.com/dns/v1beta2/projects/root-280012/managedZones/root/rrsets?alt=json&prettyPrint=false": Post "https://oauth2.googleapis.com/token": net/http: TLS handshake timeout (Client.Timeout exceeded while awaiting headers)

Resources:
    9 unchanged

Duration: 35s
leezen commented 4 years ago

The errors don't read like timeout issues from a resource creation perspective or anything pulumi itself is doing, but purely errors from a networking perspective. Have you verified you're able to talk to the Google APIs from this environment without these timeouts? It'd be great if you could also please take a look at verbose logs to see if there are any clues as to what's going on -- https://www.pulumi.com/docs/troubleshooting/#verbose-logging

martaver commented 4 years ago

I actually have no problem loading those URLs from curl or from browser... it's literally just the net/http client that seems to have problems. Similar errors in terraform too.

leezen commented 4 years ago

I wonder if this is related to https://github.com/terraform-providers/terraform-provider-google/issues/5008 which shows similar symptoms. In which case, have you tried setting the requestTimeout on the provider itself?

martaver commented 4 years ago

Hi @leezen sorry I didn't get back to you on this until now. Your suggestions about requestTimeout seems to have worked. I'm confused as to why though, because the errors always seem to be described as TLS Handshake Timeout errors. I get this error on SOME (but not all) long-running GCP requests. These operations take a long time, but then again - so do other operations that don't fail with a TLS Handhshake Timeout error.

The errors aren't intermittent, and aren't related to connectivity (I spent quite a lot of time confirming this). They seem to be tied to specific types of resources, i.e. specific resource types produce timeouts with specific types of operations. E.g. creating record sets and deleting a cluster.

The problem with customTimeout is that they only set the request timeouts for the specific atomic operations -- create, update and delete. But I suspect there are many other kinds of requests that can happen in and around these.

It could be a specific problem that the window for establishing a secure TLS tunnel is too short because some resources take longer to respond than others (maybe GCP has to do more work for these resources to authenticate them)...

It could also be that it's a more general request timeout for some long running requests who have their default timeouts set too low, and it's incorrectly reported as a TLS Handshake Timeout.

Either way, I think it would be useful if the default timeout configured for the terraform gcp provider by pulumi was higher.

It's strange that nobody else seems to be really reporting these kinds of errors though.

leezen commented 4 years ago

They seem to be tied to specific types of resources, i.e. specific resource types produce timeouts with specific types of operations. E.g. creating record sets and deleting a cluster.

Right -- per the issue linked, it seems to be that there are certain APIs that are synchronous and can take a long time. I'm also surprised we don't hear more about these kinds of errors, but that also means I'm not really inclined the alter the timeouts unless we hear otherwise.

I'm going to close the issue since it sounds like the issue is resolved.

martaver commented 4 years ago

Yeah makes sense. Maybe a big fat warning in the documentation might help :D

On Thu, Jul 30, 2020 at 7:35 PM Lee Zen notifications@github.com wrote:

They seem to be tied to specific types of resources, i.e. specific resource types produce timeouts with specific types of operations. E.g. creating record sets and deleting a cluster.

Right -- per the issue linked, it seems to be that there are certain APIs that are synchronous and can take a long time. I'm also surprised we don't hear more about these kinds of errors, but that also means I'm not really inclined the alter the timeouts unless we hear otherwise.

I'm going to close the issue since it sounds like the issue is resolved.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pulumi/pulumi-gcp/issues/375#issuecomment-666511703, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOXAMRXPHDNJSRXYNQB5MDR6GOONANCNFSM4OJKU6OQ .

martaver commented 4 years ago

Out of curiosity, what would be the negative impact of increasing the default timeout?

leezen commented 4 years ago

@martaver Having errors to surface back up more quickly vs. having the user wait to find out there's a problem

martaver commented 4 years ago

You could argue its not actually a problem though, its just how those APIs are... there’s not anything ‘wrong’ a developer needs to fix...

On Fri, 31 Jul 2020 at 20:29, Lee Zen notifications@github.com wrote:

@martaver https://github.com/martaver Having errors to surface back up more quickly vs. having the user wait to find out there's a problem

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/pulumi/pulumi-gcp/issues/375#issuecomment-667242859, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOXAMSK32GOP54MDPJNBOLR6L5PJANCNFSM4OJKU6OQ .

leezen commented 4 years ago

I don't mean there's anything wrong with those APIs. In the 'normal' case, increasing the timeout has no effect. In the 'error' case, my point was more that if there were something wrong for whatever reason, having a longer timeout could mean waiting longer to find out that there's been an issue.

martaver commented 4 years ago

Ah... yeah good point.

What's the default timeout, by the way? Any idea?

martaver commented 4 years ago

More TLS Handshake issues.

This time deleting a gcp:container:NodePool...

  gcp:container:NodePool (cluster-node-pool-dgraph):
    error: deleting urn:pulumi:development::environment::gcp:container/nodePool:NodePool::cluster-node-pool-dgraph: Get "https://container.googleapis.com/v1beta1/projects/sightful-development/locations/europe-north1-a/clusters/cluster-0f129b9/nodePools/cluster-node-pool-dgraph-5742ed1?alt=json&prettyPrint=false": Post "https://oauth2.googleapis.com/token": net/http: TLS handshake timeout (Client.Timeout exceeded while awaiting headers)

This time I've set gcp:request_timeout and all ´customTimeouts´ on the resource itself to '240s'.

Still get the error and more, it actually doesn't even affect the time-to-failure. Seems like all the timeouts I've configured are being ignored by whatever part of this is timing out.

martaver commented 4 years ago

@leezen these issues are actually quite disruptive... my current workflow is something like 1-2 hours of good productive work, and then the rest of the day trying to find workarounds to these TLS Handshake issues... usually having to manually delete resources and sync state to it etc.

Not to point the finger at pulumi... Terraform is also plagued by the same difficulties. Having trouble elegantly handling these long-running requests seem to be a common theme. I'm really crying out for an IAC framework that is just completely rock solid and reliable.

Wish pulumi would take up the sword...?

martaver commented 4 years ago

Now with creating a project...

  gcp:organizations:Project (sightful-root):
    error: failed pre-requisites: failed to check permissions on billing account "billingAccounts/<redacted>": Post "https://cloudbilling.googleapis.com/v1/billingAccounts/<redacted>:testIamPermissions?alt=json&prettyPrint=false": Post "https://oauth2.googleapis.com/token": net/http: TLS handshake timeout (Client.Timeout exceeded while awaiting headers)

This could be a clue, actually, in this organisation I don't have the permissions to view the billing account.

Could it be that these TLS Handshake errors are actually permissions errors from GCP, and that they are being surfaced with an incorrect or incomplete error message?

martaver commented 4 years ago

I gave myself billing administrator rights, which removed the restrictions I had before, but the TLS Handshake timeout errors persist...

martaver commented 4 years ago

Trying to set billingAccount on a manually created an imported project:

error: 1 error occurred:
        * updating urn:pulumi:development::root::gcp:organizations/project:Project::sightful-root: Error setting billing account "<redacted>" for project "projects/sightful-root": googleapi: Error 400: Precondition check failed., failedPrecondition

I tried to link this billing account in the UI, and got a more descriptive error - my quota for projects under the billing account was reached. I looked and found some old zombie projects someone had created, deleted them, and the TLS error went away.

Seems to point to errors being obscured away as 'TLS Handshake' errors...

Also, now I'm getting the same TLS Handshake error trying to delete RecordSets... gcp:request_timeout and customTimeouts all 240s make no difference.

E.g.

gcp:dns:RecordSet (root/sightful.dev./TXT):
    error: deleting urn:pulumi:root::root::gcp:dns/recordSet:RecordSet::root/sightful.dev./TXT: Error when reading or editing google_dns_record_set: Post "https://www.googleapis.com/dns/v1beta2/projects/root-280012/managedZones/root/changes?alt=json&prettyPrint=false": Post "https://oauth2.googleapis.com/token": net/http: TLS handshake timeout (Client.Timeout exceeded while awaiting headers)
martaver commented 4 years ago

On reflection, I am really starting to suspect that it's the case that many other errors are being surfaced as TLS Handshake errors, without providing any further details.