Closed andersy005 closed 10 months ago
Could NCEI be blocking us because we are constantly crawling their servers to download the same dataset over and over with hundreds of simultaneous requests? 😆 Just a thought...
Could NCEI be blocking us because we are constantly crawling their servers to download the same dataset over and over with hundreds of simultaneous requests? 😆 Just a thought...
Upon finding that any connection to any site other than google.com timed out, we ruled this out.
$ curl google.com -I
HTTP/1.1 301 Moved Permanently
Location: http://www.google.com/
Content-Type: text/html; charset=UTF-8
Date: Wed, 19 Oct 2022 22:26:31 GMT
Expires: Fri, 18 Nov 2022 22:26:31 GMT
Cache-Control: public, max-age=2592000
Server: gws
Content-Length: 219
X-XSS-Protection: 0
X-Frame-Options: SAMEORIGIN
$ curl github.com -I
curl: (28) Failed to connect to github.com port 80 after 129506 ms: Connection timed out
As a test, I created a regular virtual machine to ensure that this issue did not affect only dataflow VMs
$ gcloud compute instances create test --no-address
$ gcloud compute ssh test
and curl
to google.com
worked
andersy005@test:~$ curl -v google.com
* Trying 108.177.112.102:80...
* Connected to google.com (108.177.112.102) port 80 (#0)
> GET / HTTP/1.1
> Host: google.com
> User-Agent: curl/7.74.0
> Accept: */*
andersy005@test:~$ curl -v github.com
* Trying 140.82.114.3:80...
* connect to 140.82.114.3 port 80 failed: Connection timed out
* Failed to connect to github.com port 80: Connection timed out
* Closing connection 0
curl: (28) Failed to connect to github.com port 80: Connection timed out
It appears that the firewall rules did not allow outbound connections. Additionally, the following error occurs when GCP attempts to assign an external IP address to the instance:
Constraint constraints/compute.vmExternalIpAccess violated for project projectID. Add instance <project> to the constraint to use external IP with it
This is why I had to use the --no-address
flag explicitly in
$ gcloud compute instances create test --no-address
Whether this new policy is new or whether the dataflow VMs have always been set up without an external IP address is unclear to me
Whether this new policy is new or whether the dataflow VMs have always been set up without an external IP address is unclear to me
Columbia's GCP organization policy forbids VM's to have external IPs (and always has). That's why pangeo-forge-runner
's DataflowBakery defaults use_public_ips=False
for Dataflow jobs. Were the jobs that caused this error for some reason not using this default?
@yuvipanda created a cloud NAT to allow the VMs to reach the public internet as a temporary fix.
From the very beginning of our Dataflow usage, we've had a NAT in us-central1
for VMs to connect through. Did that NAT go down, or why was it necessary to create an additional one?
At least as of early October, all Dataflow jobs needed to be created within the default us-central1
region because that's the only place we were running a NAT. As of then, jobs created outside of us-central1
would fail with connectivity issues, because the VM's wouldn't've had a NAT to connect through.
Recently, we started experiencing dataflow failures. The failures seem to be caused by connectivity issues between the dataflow runners and the external services they are trying to access. VMs couldn’t reach the public internet for reasons that we don't understand. For instance, running a curl command within the VM results in a connection timeout
@yuvipanda created a cloud NAT to allow the VMs to reach the public internet as a temporary fix. The reason for these connectivity issues is still unknown, to the best of my knowledge