pangeo-forge / pangeo-forge-orchestrator

Database API and GitHub App backend for Pangeo Forge Cloud.
https://api.pangeo-forge.org/docs
Apache License 2.0
4 stars 1 forks source link

Dataflow runner connectivity issues on GCP #169

Closed andersy005 closed 10 months ago

andersy005 commented 2 years ago

Recently, we started experiencing dataflow failures. The failures seem to be caused by connectivity issues between the dataflow runners and the external services they are trying to access. VMs couldn’t reach the public internet for reasons that we don't understand. For instance, running a curl command within the VM results in a connection timeout

$ curl -v -4 https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc
*   Trying 205.167.25.178:443...
* connect to 205.167.25.178 port 443 failed: Connection timed out

@yuvipanda created a cloud NAT to allow the VMs to reach the public internet as a temporary fix. The reason for these connectivity issues is still unknown, to the best of my knowledge

rabernat commented 2 years ago

Could NCEI be blocking us because we are constantly crawling their servers to download the same dataset over and over with hundreds of simultaneous requests? 😆 Just a thought...

andersy005 commented 2 years ago

Could NCEI be blocking us because we are constantly crawling their servers to download the same dataset over and over with hundreds of simultaneous requests? 😆 Just a thought...

Upon finding that any connection to any site other than google.com timed out, we ruled this out.

$ curl google.com -I
HTTP/1.1 301 Moved Permanently
Location: http://www.google.com/
Content-Type: text/html; charset=UTF-8
Date: Wed, 19 Oct 2022 22:26:31 GMT
Expires: Fri, 18 Nov 2022 22:26:31 GMT
Cache-Control: public, max-age=2592000
Server: gws
Content-Length: 219
X-XSS-Protection: 0
X-Frame-Options: SAMEORIGIN
$ curl github.com -I
curl: (28) Failed to connect to github.com port 80 after 129506 ms: Connection timed out

As a test, I created a regular virtual machine to ensure that this issue did not affect only dataflow VMs

$ gcloud compute instances create test --no-address
$ gcloud compute ssh test

and curl to google.com worked

andersy005@test:~$ curl -v google.com
*   Trying 108.177.112.102:80...
* Connected to google.com (108.177.112.102) port 80 (#0)
> GET / HTTP/1.1
> Host: google.com
> User-Agent: curl/7.74.0
> Accept: */*

andersy005@test:~$ curl -v github.com
*   Trying 140.82.114.3:80...
* connect to 140.82.114.3 port 80 failed: Connection timed out
* Failed to connect to github.com port 80: Connection timed out
* Closing connection 0
curl: (28) Failed to connect to github.com port 80: Connection timed out
andersy005 commented 2 years ago

It appears that the firewall rules did not allow outbound connections. Additionally, the following error occurs when GCP attempts to assign an external IP address to the instance:

Constraint constraints/compute.vmExternalIpAccess violated for project projectID. Add instance <project> to the constraint to use external IP with it

This is why I had to use the --no-address flag explicitly in

$ gcloud compute instances create test --no-address

Whether this new policy is new or whether the dataflow VMs have always been set up without an external IP address is unclear to me

cisaacstern commented 1 year ago

Whether this new policy is new or whether the dataflow VMs have always been set up without an external IP address is unclear to me

Columbia's GCP organization policy forbids VM's to have external IPs (and always has). That's why pangeo-forge-runner's DataflowBakery defaults use_public_ips=False for Dataflow jobs. Were the jobs that caused this error for some reason not using this default?

@yuvipanda created a cloud NAT to allow the VMs to reach the public internet as a temporary fix.

From the very beginning of our Dataflow usage, we've had a NAT in us-central1 for VMs to connect through. Did that NAT go down, or why was it necessary to create an additional one?

At least as of early October, all Dataflow jobs needed to be created within the default us-central1 region because that's the only place we were running a NAT. As of then, jobs created outside of us-central1 would fail with connectivity issues, because the VM's wouldn't've had a NAT to connect through.