googleapis / cloud-profiler-nodejs

Node.js library for Google Cloud Profiler. Continuous CPU and heap profiling to improve performance and reduce costs.
https://cloud.google.com/profiler/
Apache License 2.0
68 stars 35 forks source link

Authentication on GKE #463

Closed montmanu closed 5 years ago

montmanu commented 5 years ago

It appears that the keyFilename configuration option is ignored when attempting to authenticate to the Profiler API on GKE.

When viewing the API usage by Credential, only the Compute Engine default service account appears to be interacting with the API.

The agent is logging error messages similar to the following:

@google-cloud/profiler Failed to create profile, waiting 12m 33.3s to try again: Error: The caller does not have permission

More details:

node --version
# v8.15.1
npm ls @google-cloud/profiler
# @google-cloud/profiler@1.1.2
uname -a
# Linux renderer-56769d9869-rl48v 4.14.89+ #1 SMP Wed Jan 9 13:35:00 PST 2019 x86_64 Linux

gcloud container clusters describe $CLUSTER_NAME --zone $CLUSTER_ZONE --format json | \
  jq '{ currentMasterVersion, currentNodeVersion, initialClusterVersion, location, locations, loggingService, monitoringService, nodeConfig: { oauthScopes: .nodeConfig.oauthScopes } }'

{
  "currentMasterVersion": "1.12.6-gke.10",
  "currentNodeVersion": "1.12.5-gke.5",
  "initialClusterVersion": "1.8.4-gke.0",
  "location": "us-east1-b",
  "locations": [
    "us-east1-b",
    "us-east1-c",
    "us-east1-d"
  ],
  "loggingService": "logging.googleapis.com/kubernetes",
  "monitoringService": "monitoring.googleapis.com/kubernetes",
  "nodeConfig": {
    "oauthScopes": [
      "https://www.googleapis.com/auth/bigquery",
      "https://www.googleapis.com/auth/cloud-platform",
      "https://www.googleapis.com/auth/cloud.useraccounts",
      "https://www.googleapis.com/auth/cloud.useraccounts.readonly",
      "https://www.googleapis.com/auth/cloud_debugger",
      "https://www.googleapis.com/auth/compute",
      "https://www.googleapis.com/auth/compute.readonly",
      "https://www.googleapis.com/auth/datastore",
      "https://www.googleapis.com/auth/devstorage.full_control",
      "https://www.googleapis.com/auth/devstorage.read_only",
      "https://www.googleapis.com/auth/devstorage.read_write",
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
      "https://www.googleapis.com/auth/monitoring.write",
      "https://www.googleapis.com/auth/pubsub",
      "https://www.googleapis.com/auth/service.management.readonly",
      "https://www.googleapis.com/auth/servicecontrol",
      "https://www.googleapis.com/auth/source.full_control",
      "https://www.googleapis.com/auth/source.read_only",
      "https://www.googleapis.com/auth/sqlservice",
      "https://www.googleapis.com/auth/sqlservice.admin",
      "https://www.googleapis.com/auth/taskqueue",
      "https://www.googleapis.com/auth/trace.append",
      "https://www.googleapis.com/auth/userinfo.email"
    ]
  }
}
kalyanac commented 5 years ago

@nolanmar511 can you take a look ?

nolanmar511 commented 5 years ago

@JustinBeckwith -- Most of @google-cloud/profiler's authentication is handled through @google-cloud/common (and, then through google-auth-library). The keyFilename field of @google-cloud/profiler's options comes from GoogleAuthOptions. Do you know what could be happening?

(Possibly interestingly, GoogleAuthOptions has both a keyFile and a keyFilename field, which surprised me a bit.)

JustinBeckwith commented 5 years ago

👋 @montmanu can you provide a code snippet of how you're trying to use the profiler?

montmanu commented 5 years ago

Sure thing!

Here is an overview of how we went about integrating the profiler:

  1. Enabled the API
gcloud services enable cloudprofiler.googleapis.com
  1. Installed the NPM package
npm install --save @google-cloud/profiler
  1. Imported the NPM package and started the agent
// ...
import * as profilerAgent from '@google-cloud/profiler';
// ...

/**
 * keyFilename == "/etc/secrets/sd-profiler-agent-key.json"
 * logLevel == 4
 * projectId == "my-project-id"
 * serviceContext.service == "renderer"
 * serviceContext.version == "bffadda8ab32b1a236bfeb9456fa43c5308a2597"
 */
import profilerAgentOptions from './config/observability/profiler-agent';

// ...
profilerAgent.start(profilerAgentOptions);
// ...

Regarding the host environment, the container is built using node:8-alpine@sha256:8e9987a6d91d783c56980f1bd4b23b4c05f9f6076d513d6350fef8fe09ed01fd as the base image. That base image is extended with the following utilities:

# ...
RUN \
  apk add --update --no-cache bind-tools curl alpine-sdk
# ...

Here is the relevant npm install log output:

> pprof@0.2.0 install /dist/node_modules/pprof
> node-pre-gyp install --fallback-to-build

node-pre-gyp WARN Using request for node-pre-gyp https download 
[pprof] Success: "/dist/node_modules/pprof/build/node-v57-linux-x64-musl/pprof.node" is installed via remote

> protobufjs@6.8.8 postinstall /dist/node_modules/protobufjs
> node scripts/postinstall

The Service Account's key is stored in a Kubernetes secret and mounted into the container as a volume. Here is a selection from a sample Pod configuration:

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: renderer
    cluster: us-east1
    env: stg
    namespace: default
    project: hybrid
    revision: bffadda
  name: renderer-65c9b59b55-wnwk6
  namespace: default
spec:
  containers:
  - env:
    # ...
    - name: CLOUD_PROFILER_KEY_FILE
      value: /etc/secrets/sd-profiler-agent-key.json
    # ...
    name: renderer
    # ...
    volumeMounts:
    - mountPath: /etc/secrets
      name: secrets
      readOnly: true
    # ...
  volumes:
  # ...
  - name: secrets
    secret:
      defaultMode: 420
      secretName: renderer-secrets
  # ...

I have validated the contents of /etc/secrets/sd-profiler-agent-key.json on the file system within a running container.

The Service Account has the following IAM Roles applied:

Let me know if you need any additional information.

montmanu commented 5 years ago

not sure if it is relevant, but we are using a few other APM agents:

/**
 * This module should be loaded at the application's entry point
 * order matters here ...
 *
 * 1. trace agent
 * 2. profiler agent
 * 3. error reporting agent
 * 4. debug agent
 */
import * as traceAgent from '@google-cloud/trace-agent';
import * as profilerAgent from '@google-cloud/profiler';
import { ErrorReporting } from '@google-cloud/error-reporting';
import * as debugAgent from '@google-cloud/debug-agent';
// ...
traceAgent.start(traceAgentOptions);
// ...
if (true) {
  profilerAgent.start(profilerAgentOptions);
}
// ...
if (true) {
  errorReportingAgent = new ErrorReporting(errorReportingOptions);
}
// ...
if (true) {
  debugAgent.start(debugAgentOptions);
}
// ...
JustinBeckwith commented 5 years ago

This looks like it should work. @nolanmar511 I can't seem to npm install on OSX, so it's very hard for me to test this :/

nolanmar511 commented 5 years ago

For OSX, a few additional dependencies are required, but the profiling agent should still work. https://github.com/nodejs/node-gyp#installation

nolanmar511 commented 5 years ago

Starting to experiment with this.

To test, I had two projects (I'll call them A and B). I created a key for project A to use Stackdriver Profiler's agents. I then created a GCE VM in project B and ran some Node.js with the profiling agent.

So, snippet for starting the profiling agent:

require('@google-cloud/profiler'). start({
   keyFilename: "sd-profiler-key-for-project-A.json",
   projectId: "project-id-for-profile-a",
   serviceContext: { service: "service"},
   logLevel: 4,
});

With this, I was able to collect and upload profiles from project B's GCE VM into project A. So, "keyFilename" does work with profiler.

I'm a bit puzzled. @google-cloud/profiler, @google-cloud/trace, and @google-cloud/debug all use @google-cloud/common in the same way for authentication (and the latest version of @google-cloud/profiler and @google-cloud/trace both depend on @google-cloud/common version 0.31.X). So, this would have to be a GKE/profiler specific problem, and I don't quite see how that would happen.

Next step is to try this on GKE.

montmanu commented 5 years ago

Thanks for digging in. I started to try out using the google-auth-library directly with a limited test case.. something like the following after (kubectl execing into a running container) ..

/** @see https://github.com/googleapis/google-auth-library-nodejs/blob/master/samples/keyfile.js */
const {auth} = require('google-auth-library');

/**
 * Acquire a client, make a request to an API that the Service Account has permissions to access
 */
(function (){
  async function main(keyFile) {
    const client = await auth.getClient({
      keyFile: keyFile,
      scopes: 'https://www.googleapis.com/auth/monitoring',
    });
    const projectId = await auth.getProjectId();
    const url = `https://cloudprofiler.googleapis.com/v2/profiles`;
    const res = await client.request({url});
    console.log('Profiler Info:');
    console.log(res.data);
  }

  main(process.env.CLOUD_PROFILER_KEY_FILE).catch(console.error);
})();

I'm sort of guessing wrt the actual Profiler API request details.. only had time to track down the baseUrl value for the API .. I tried sending a few GET requests with several variations in the path / params / etc, but was unable to successfully list any profiles.. all 4xx..

nolanmar511 commented 5 years ago

I haven't been able to reproduce this on GKE when specifying keyFilename and trying to upload to the same project or when specifying keyFilename and trying to upload to a different project.

I have reproduced the error message (Error: The caller does not have permission) when the key file isn't right (for example, when I tried to use a key created for with the role of Stackdriver Profiler User instead of the role Stackdriver Profiler Agent; or when I tried to use a key made for project A to upload to project B).

It's possible I just haven't figured out how to reproduce this, but I'd like to rule out other potential issues.

Is it possible the key file isn't associated with the project you're trying to upload profiles to, or possible that the service account doesn't have the Stackdriver Profiler Agent role?

montmanu commented 5 years ago

thanks again!

Unfortunately, the key file appears to be correct and the service account appears to have the Stackdriver Profiler Agent role applied :/

kubectl exec -it renderer-5d74495b6f-pchkg -c renderer -- cat /etc/secrets/sd-profiler-agent-key.json
{
  "type": "service_account",
  "project_id": "my-project-id",
  "private_key_id": "SNIP",
  "private_key": "-----BEGIN PRIVATE KEY-----\nSNIP\n-----END PRIVATE KEY-----\n",
  "client_email": "sd-profiler-agent@my-project-id.iam.gserviceaccount.com",
  "client_id": "SNIP",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/sd-profiler-agent%40my-project-id.iam.gserviceaccount.com"
}
Screen Shot 2019-04-22 at 4 25 41 PM

Agree with you though.. its strange that the other APM libraries authenticate successfully using the metadata service defaults and that this library had problems.. especially if they are sharing the same underlying auth dependencies.

This particular project may have been an EAP participant for the Profiler product.. is it possible that something about that EAP participation is affecting the use of the GA API?

montmanu commented 5 years ago

fwiw, this same configuration is working as expected in a different project (different clusters, different Service Account, and therefore, different key.. but same IAM policy, APM integration details).. the project where this is working would likely not have been an EAP participant which may give a bit more weight to that EAP related hypothesis..

nolanmar511 commented 5 years ago

@aalexand -- Could a project having been part of EAP impact authentication?

aalexand commented 5 years ago

@nolanmar511 I can't think of how it could.

nolanmar511 commented 5 years ago

@montmanu -- you indicated only the Compute Engine default service account appears to be interacting with the API. Was it unexpected that the Compute Engine default service account interacted with the API? Could something else be using that token?

Based on my experiments, my assumption is that the profiling agent is using the file specified by keyFilename, but that that key file doesn't grant the necessary permissions.

Would it be possible to delete and re-create the key file?

nolanmar511 commented 5 years ago

Another possible guess: Based on you comments above, it looks like you specify the projectId in the configuration. But, just in case that's not the case, specifying the projectId in the configuration (and, to be overly-specific, ensuring that that project id matches the project id in the key file) might help.

Similarly, specifying the exact project ID in your example using google-auth-library, rather than using await auth.getProjectId() could be helpful...

I mention this because if, somehow, the project id specified in the configuration and the project id in the key file don't match, the " The caller does not have permission" definitely appears.

montmanu commented 5 years ago

ok thanks. yes, i can definitely delete / re-create that key and re-test. will also confirm that the projectId is correct. will follow up once that is complete.

montmanu commented 5 years ago

So I have not yet had a chance to delete / re-create the SA key, but I have noticed new error messages related to auth in a couple of the other APM agents being used:

@google-cloud/debug-agent Failed to re-register debuggee 163243153602: Error: Unexpected error determining execution environment: request to http://metadata.google.internal./computeMetadata/v1/instance failed, reason: getaddrinfo EAI_AGAIN metadata.google.internal.:80
ERROR:@google-cloud/error-reporting: Unable to find credential information on instance. This library will be unable to communicate with the Stackdriver API to save errors. Message: Unexpected error determining execution environment: request to http://metadata.google.internal/computeMetadata/v1/instance/ failed, reason: getaddrinfo EAI_AGAIN metadata.google.internal:80

This cluster has node auto-updates enabled, so the cluster details have changed slightly from when this issue was created:

{
  "currentMasterVersion": "1.12.7-gke.7",
  "currentNodeVersion": "1.12.7-gke.7",
  "initialClusterVersion": "1.8.4-gke.0",
  "location": "us-east1-b",
  "locations": [
    "us-east1-b",
    "us-east1-c",
    "us-east1-d"
  ],
  "loggingService": "logging.googleapis.com/kubernetes",
  "monitoringService": "monitoring.googleapis.com/kubernetes",
  "nodeConfig": {
    "oauthScopes": [
      "https://www.googleapis.com/auth/bigquery",
      "https://www.googleapis.com/auth/cloud-platform",
      "https://www.googleapis.com/auth/cloud.useraccounts",
      "https://www.googleapis.com/auth/cloud.useraccounts.readonly",
      "https://www.googleapis.com/auth/cloud_debugger",
      "https://www.googleapis.com/auth/compute",
      "https://www.googleapis.com/auth/compute.readonly",
      "https://www.googleapis.com/auth/datastore",
      "https://www.googleapis.com/auth/devstorage.full_control",
      "https://www.googleapis.com/auth/devstorage.read_only",
      "https://www.googleapis.com/auth/devstorage.read_write",
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
      "https://www.googleapis.com/auth/monitoring.write",
      "https://www.googleapis.com/auth/pubsub",
      "https://www.googleapis.com/auth/service.management.readonly",
      "https://www.googleapis.com/auth/servicecontrol",
      "https://www.googleapis.com/auth/source.full_control",
      "https://www.googleapis.com/auth/source.read_only",
      "https://www.googleapis.com/auth/sqlservice",
      "https://www.googleapis.com/auth/sqlservice.admin",
      "https://www.googleapis.com/auth/taskqueue",
      "https://www.googleapis.com/auth/trace.append",
      "https://www.googleapis.com/auth/userinfo.email"
    ]
  }
}

namely, currentMasterVersion and currentNodeVersion have both changed to 1.12.7-gke.7

nolanmar511 commented 5 years ago

@montmanu -- Have you had a chance to re-create the SA key? Also, should this be moved to google-cloud/common, or google-auth-library if authentication is impacting multiple agents?

nolanmar511 commented 5 years ago

At this point, I'm closing this issue.

I don't think it's actionable for profiler without further information, and sounds like the problem may not be profiler-specific.

Feel free to re-open with additional context.