Documentation/support for Google Cloud SQL Auth Proxy or Connector

jawnsy commented 1 year ago

Summary

Support the Cloud SQL Auth Proxy or Connector.

Background

The current setup instructions for Cloud SQL uses password authentication:

CREATE USER pganalyze WITH PASSWORD 'waJzOTWiQQjCZLr4' CONNECTION LIMIT 5;

Cloud SQL instances that use the (somewhat misleadingly-named) Require SSL setting do not support password authentication; instead, mutual TLS with client certificates must be used instead (from the referenced page):

Enforcing SSL/TLS encryption as described below will make mutual TLS required for all clients

The CIS Benchmark for Google Cloud Platform, version 1.3.0, also recommends enabling this flag, so anyone running compliance scans will see this as a finding.

For anyone using Require SSL, it is often convenient to use the Cloud SQL Auth Proxy, which handles generating and rotating ephemeral certificates using an IAM identity. This can run as a sidecar and is compatible with pganalyze-collector as it exists today. Additionally, enabling the Connector Enforcement feature requires connections to the database to use the proxy or connector library.

While it is still possible to use password authentication with the auth proxy, it is less convenient than using IAM authentication, and IAM permissions are necessary anyways to issue the ephemeral certificate and establish a connection in the first place.

For customers running Cloud SQL with IAM Authentication, and running the pganalyze-collector using Cloud Run or GKE with Workload Identity, it is convenient to use IAM authentication, since it means that passwords will not need periodic rotation, and it also provides a stronger form of workload identity.

Approach 1: Cloud SQL Auth Proxy

When running in Cloud Run, Google has an option to enable connections to databases using its private address. I do not have experience using this method, but I'd expect that it effectively runs a cloud sql proxy inside the application container, so I'd expect it to support the same authentication (using the IAM service account name, e.g. pganalyze-collector@project.iam

When running in GKE, users can follow the Workload Identity steps to create a service account, and grant it access to:

Watch and pull logs from the Pub/Sub topic used for log insights (requires Pubsub Subscriber on the subscription; attaching to the subscription frequently also requires PubSub Viewer permissions, so I granted that to the account as well))
Connect to the database and authenticate with IAM (requires Cloud SQL Instance User and Cloud SQL Client permissions on the project and adding the service account to database instance access list)

Rough instructions:

Create a Google service account

gcloud iam service-accounts create GSA_NAME \
 --project=GSA_PROJECT

Create an IAM binding on the project to grant Cloud SQL permissions for the service account - particularly Cloud SQL Instance User (required for IAM authentication to the database) and Cloud SQL Client (required to use the proxy/connector to issue certificates):
```
gcloud projects add-iam-policy-binding PROJECT_ID \
 --member "serviceAccount:GSA_NAME@GSA_PROJECT.iam.gserviceaccount.com" \
 --role "ROLE_NAME"
```
where ROLE_NAME should be roles/cloudsql.instanceUser and then roles/cloudsql.client (you will need to run the above command twice)

Documentation: https://cloud.google.com/sdk/gcloud/reference/projects/add-iam-policy-binding
Create an IAM binding on the Pub/Sub Subscription object to grant Pub/Sub permissions for the service account - particularly Pub/Sub Subscriber and Pub/Sub Viewer.
```
gcloud pubsub subscriptions add-iam-policy-binding SUBSCRIPTION \
 --member GSA_NAME@GSA_PROJECT.iam.gserviceaccount.com" \
 --role "ROLE_NAME"
```
where ROLE_NAME should be roles/pubsub.subscriber and then roles/pubsub.viewer (sometimes Subscriber is not necessary, depending on what the code is doing, but I have not checked the source nor tried it yet; Viewer seems safe enough to grant anyways)

Documentation: https://cloud.google.com/sdk/gcloud/reference/pubsub/subscriptions/add-iam-policy-binding

Create a Kubernetes service account with an appropriate annotation to use this identity (note: this assumes you already did the steps of configuring the workload identity pool and enabling workload identity on your GKE cluster):
```
apiVersion: v1
kind: ServiceAccount
metadata:
 name: pganalyze
 annotations:
   iam.gke.io/gcp-service-account: "GSA_NAME@GSA_PROJECT.iam.gserviceaccount.com"
```

Create a Deployment using this service account, which includes the proxy as a sidecar; these are roughly the settings we're using:

apiVersion: apps/v1
kind: Deployment
metadata:
 name: pganalyze
spec:
 selector:
   matchLabels:
     app: pganalyze
 revisionHistoryLimit: 0
 replicas: 1
 template:
  metadata:
    labels:
      app: pganalyze
      version: v1
  spec:
    serviceAccountName: pganalyze
    containers:
      - image: quay.io/pganalyze/collector:stable
        name: pganalyze
        env:
          - name: GCP_PUBSUB_SUBSCRIPTION
            value: projects/PROJECT_ID/subscriptions/SUBSCRIPTION_ID
          - name: GCP_CLOUDSQL_INSTANCE_ID
            value: project-id:us-east1:instance
          - name: PGA_API_KEY
            value: xxx
          - name: DB_USERNAME
            value: GSA_NAME@GSA_PROJECT.iam
          - name: DB_HOST
            value: 127.0.0.1
          - name: DB_PORT
            value: "5432"
          - name: DB_NAME
            value: postgres
        resources:
          requests:
            cpu: 250m
            memory: 250Mi
          limits:
            cpu: 500m
            memory: 500Mi
      - name: cloud-sql-proxy
        image: gcr.io/cloudsql-docker/gce-proxy:1.33.1-alpine
        command:
          - "/cloud_sql_proxy"
          - "-enable_iam_login"
          - "-ip_address_types=PRIVATE"
          - "-structured_logs"
          - "-term_timeout=10s"
          - "-use_http_health_check"
          - "-health_check_port=8090"
          - "-instances=project-id:us-east1:instance=tcp:5432"
        resources:
          requests:
            cpu: 20m
            memory: 32Mi
          limits:
            cpu: 100m
            memory: 128Mi
        livenessProbe:
          httpGet:
            path: /liveness
            port: 8090
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 1
        readinessProbe:
          httpGet:
            path: /readiness
            port: 8090
          initialDelaySeconds: 5
          periodSeconds: 10
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 1
        startupProbe:
          httpGet:
            path: /startup
            port: 8090
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 5
          failureThreshold: 4

Approach 2: Cloud SQL Connector

The Cloud SQL Auth Proxy is written in Go, and Google uses a library to implement its functionality, which is also published separately as the cloud-sql-go-connector library.

The connector should be relatively straightforward to integrate, and allows applications to connect directly to the database without need for a sidecar, as described in the documentation: https://github.com/GoogleCloudPlatform/cloud-sql-go-connector#postgres

In this case, pganalyze-collector should be able to use the instance ID to discover the IP address to connect to, automatically handle issuing and rotating certificates, etc. This could be done using the existing GCP_CLOUDSQL_INSTANCE_ID variable that the collector already requires.

The proxy sidecar approach may still be preferable because it simplifies collection of metrics, but the preferred approach varies by company and team. Note also that we're using Cloud SQL Auth Proxy v1, but there's a v2 in preview, in which case the flags are slightly different.

lfittl commented 1 year ago

@jawnsy Thanks for the detailed write-up, really appreciate this! (and I'll make a note to see if we can revise the docs to cover this more directly)

One question for my understanding:

The proxy sidecar approach may still be preferable because it simplifies collection of metrics, but the preferred approach varies by company and team. Note also that we're using Cloud SQL Auth Proxy v1, but there's a v2 in preview, in which case the flags are slightly different.

You mention "it simplifies collection of metrics" - could you clarify which metrics you were referring to?

jawnsy commented 1 year ago

Google has two versions of the auth proxy, and their new version (currently in preview) has support for some Prometheus metrics and tracing, intended for observability of the proxy itself (otherwise, failures can only been seen in logs, which are a lot harder to work with, especially because their earlier version emitted unstructured or semi-structured logs).

At Prefect, we are currently using cloud-sql-proxy v1 because it is GA, while v2 is currently in preview. The risk probably isn't significant because the proxy/library is a relatively thin wrapper that handles refreshing certificates periodically, but nonetheless, we stick to GA services for anything in the critical path of our production service. I don't have visibility to their planned v2 GA, but I'd expect it to be within the next 6-12 months.

Approach 2 has the benefit of being sidecarless/proxyless and if you don't care about observability of the metrics or don't use Prometheus, then that would work great. The Go library exposes the same metrics so with some work in pganalyze-collector, you can expose these same metrics too, possibly alongside your own metrics, but of course this would be Google-specific and more implementation effort on your end.

pganalyze / collector

Documentation/support for Google Cloud SQL Auth Proxy or Connector #348