pganalyze / collector

pganalyze statistics collector for gathering PostgreSQL metrics and log data
https://pganalyze.com
Other
333 stars 59 forks source link

Documentation/support for Google Cloud SQL Auth Proxy or Connector #348

Open jawnsy opened 1 year ago

jawnsy commented 1 year ago

Summary

Support the Cloud SQL Auth Proxy or Connector.

Background

The current setup instructions for Cloud SQL uses password authentication:

CREATE USER pganalyze WITH PASSWORD 'waJzOTWiQQjCZLr4' CONNECTION LIMIT 5;

Cloud SQL instances that use the (somewhat misleadingly-named) Require SSL setting do not support password authentication; instead, mutual TLS with client certificates must be used instead (from the referenced page):

Enforcing SSL/TLS encryption as described below will make mutual TLS required for all clients

The CIS Benchmark for Google Cloud Platform, version 1.3.0, also recommends enabling this flag, so anyone running compliance scans will see this as a finding.

For anyone using Require SSL, it is often convenient to use the Cloud SQL Auth Proxy, which handles generating and rotating ephemeral certificates using an IAM identity. This can run as a sidecar and is compatible with pganalyze-collector as it exists today. Additionally, enabling the Connector Enforcement feature requires connections to the database to use the proxy or connector library.

While it is still possible to use password authentication with the auth proxy, it is less convenient than using IAM authentication, and IAM permissions are necessary anyways to issue the ephemeral certificate and establish a connection in the first place.

For customers running Cloud SQL with IAM Authentication, and running the pganalyze-collector using Cloud Run or GKE with Workload Identity, it is convenient to use IAM authentication, since it means that passwords will not need periodic rotation, and it also provides a stronger form of workload identity.

Approach 1: Cloud SQL Auth Proxy

When running in Cloud Run, Google has an option to enable connections to databases using its private address. I do not have experience using this method, but I'd expect that it effectively runs a cloud sql proxy inside the application container, so I'd expect it to support the same authentication (using the IAM service account name, e.g. pganalyze-collector@project.iam

When running in GKE, users can follow the Workload Identity steps to create a service account, and grant it access to:

Rough instructions:

  1. Create a Google service account

    gcloud iam service-accounts create GSA_NAME \
     --project=GSA_PROJECT
  2. Create an IAM binding on the project to grant Cloud SQL permissions for the service account - particularly Cloud SQL Instance User (required for IAM authentication to the database) and Cloud SQL Client (required to use the proxy/connector to issue certificates):

    gcloud projects add-iam-policy-binding PROJECT_ID \
     --member "serviceAccount:GSA_NAME@GSA_PROJECT.iam.gserviceaccount.com" \
     --role "ROLE_NAME"

    where ROLE_NAME should be roles/cloudsql.instanceUser and then roles/cloudsql.client (you will need to run the above command twice)

    Documentation: https://cloud.google.com/sdk/gcloud/reference/projects/add-iam-policy-binding

  3. Create an IAM binding on the Pub/Sub Subscription object to grant Pub/Sub permissions for the service account - particularly Pub/Sub Subscriber and Pub/Sub Viewer.

    gcloud pubsub subscriptions add-iam-policy-binding SUBSCRIPTION \
     --member GSA_NAME@GSA_PROJECT.iam.gserviceaccount.com" \
     --role "ROLE_NAME"

    where ROLE_NAME should be roles/pubsub.subscriber and then roles/pubsub.viewer (sometimes Subscriber is not necessary, depending on what the code is doing, but I have not checked the source nor tried it yet; Viewer seems safe enough to grant anyways)

Documentation: https://cloud.google.com/sdk/gcloud/reference/pubsub/subscriptions/add-iam-policy-binding

  1. Create a Kubernetes service account with an appropriate annotation to use this identity (note: this assumes you already did the steps of configuring the workload identity pool and enabling workload identity on your GKE cluster):

    apiVersion: v1
    kind: ServiceAccount
    metadata:
     name: pganalyze
     annotations:
       iam.gke.io/gcp-service-account: "GSA_NAME@GSA_PROJECT.iam.gserviceaccount.com"
  2. Create a Deployment using this service account, which includes the proxy as a sidecar; these are roughly the settings we're using:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
     name: pganalyze
    spec:
     selector:
       matchLabels:
         app: pganalyze
     revisionHistoryLimit: 0
     replicas: 1
     template:
      metadata:
        labels:
          app: pganalyze
          version: v1
      spec:
        serviceAccountName: pganalyze
        containers:
          - image: quay.io/pganalyze/collector:stable
            name: pganalyze
            env:
              - name: GCP_PUBSUB_SUBSCRIPTION
                value: projects/PROJECT_ID/subscriptions/SUBSCRIPTION_ID
              - name: GCP_CLOUDSQL_INSTANCE_ID
                value: project-id:us-east1:instance
              - name: PGA_API_KEY
                value: xxx
              - name: DB_USERNAME
                value: GSA_NAME@GSA_PROJECT.iam
              - name: DB_HOST
                value: 127.0.0.1
              - name: DB_PORT
                value: "5432"
              - name: DB_NAME
                value: postgres
            resources:
              requests:
                cpu: 250m
                memory: 250Mi
              limits:
                cpu: 500m
                memory: 500Mi
          - name: cloud-sql-proxy
            image: gcr.io/cloudsql-docker/gce-proxy:1.33.1-alpine
            command:
              - "/cloud_sql_proxy"
              - "-enable_iam_login"
              - "-ip_address_types=PRIVATE"
              - "-structured_logs"
              - "-term_timeout=10s"
              - "-use_http_health_check"
              - "-health_check_port=8090"
              - "-instances=project-id:us-east1:instance=tcp:5432"
            resources:
              requests:
                cpu: 20m
                memory: 32Mi
              limits:
                cpu: 100m
                memory: 128Mi
            livenessProbe:
              httpGet:
                path: /liveness
                port: 8090
              initialDelaySeconds: 10
              periodSeconds: 10
              timeoutSeconds: 5
              failureThreshold: 1
            readinessProbe:
              httpGet:
                path: /readiness
                port: 8090
              initialDelaySeconds: 5
              periodSeconds: 10
              timeoutSeconds: 5
              successThreshold: 1
              failureThreshold: 1
            startupProbe:
              httpGet:
                path: /startup
                port: 8090
              initialDelaySeconds: 5
              periodSeconds: 5
              timeoutSeconds: 5
              failureThreshold: 4

Approach 2: Cloud SQL Connector

The Cloud SQL Auth Proxy is written in Go, and Google uses a library to implement its functionality, which is also published separately as the cloud-sql-go-connector library.

The connector should be relatively straightforward to integrate, and allows applications to connect directly to the database without need for a sidecar, as described in the documentation: https://github.com/GoogleCloudPlatform/cloud-sql-go-connector#postgres

In this case, pganalyze-collector should be able to use the instance ID to discover the IP address to connect to, automatically handle issuing and rotating certificates, etc. This could be done using the existing GCP_CLOUDSQL_INSTANCE_ID variable that the collector already requires.

The proxy sidecar approach may still be preferable because it simplifies collection of metrics, but the preferred approach varies by company and team. Note also that we're using Cloud SQL Auth Proxy v1, but there's a v2 in preview, in which case the flags are slightly different.

lfittl commented 1 year ago

@jawnsy Thanks for the detailed write-up, really appreciate this! (and I'll make a note to see if we can revise the docs to cover this more directly)

One question for my understanding:

The proxy sidecar approach may still be preferable because it simplifies collection of metrics, but the preferred approach varies by company and team. Note also that we're using Cloud SQL Auth Proxy v1, but there's a v2 in preview, in which case the flags are slightly different.

You mention "it simplifies collection of metrics" - could you clarify which metrics you were referring to?

jawnsy commented 1 year ago

Google has two versions of the auth proxy, and their new version (currently in preview) has support for some Prometheus metrics and tracing, intended for observability of the proxy itself (otherwise, failures can only been seen in logs, which are a lot harder to work with, especially because their earlier version emitted unstructured or semi-structured logs).

At Prefect, we are currently using cloud-sql-proxy v1 because it is GA, while v2 is currently in preview. The risk probably isn't significant because the proxy/library is a relatively thin wrapper that handles refreshing certificates periodically, but nonetheless, we stick to GA services for anything in the critical path of our production service. I don't have visibility to their planned v2 GA, but I'd expect it to be within the next 6-12 months.

Approach 2 has the benefit of being sidecarless/proxyless and if you don't care about observability of the metrics or don't use Prometheus, then that would work great. The Go library exposes the same metrics so with some work in pganalyze-collector, you can expose these same metrics too, possibly alongside your own metrics, but of course this would be Google-specific and more implementation effort on your end.