signalfx / splunk-otel-lambda

Splunk distribution of OpenTelemetry Lambda
Apache License 2.0
5 stars 4 forks source link

Inject SPLUNK_ACCESS_TOKEN as secret #129

Open dude0001 opened 1 year ago

dude0001 commented 1 year ago

In our environment, we are asked to not put the ingestion token as plaintext in the SPLUNK_ACCESS_TOKEN environment variable, as anyone is able to read this in the AWS console or APIs when describing the Lambda. To work around this, we created our own Lambda layer which is a Lambda execution wrapper around the Splunk provided wrapper. Our own wrapper expects an AWS Secrets Manager ARN as an environment variable. It then fetches the secret, parses out the token and sets the SPLUNK_ACCESS_TOKEN environment variable. Our wrapper then calls the Splunk wrapper to continue as normal.

The change in #114 has broken this flow for us. It looks like the OTEL collector starts up before our own wrapper is able to execute and set up the environment variable.

Is there a way we can delay the OTEL collector starting up? Is there another way to keep the token secret and out of the AWS Lambda console as plaintext?

Or can a mechanism be added to the Lambda Layer that can fetch the token from a secret input as an environment variable? The script could either use the plaintext value of the secret, or expert JSON and use syntax similar to what AWS ECS uses that expects the secret value to be JSON and which key to pull the token from. e.g. arn:aws:secretsmanager:region:aws_account_id:secret:secret-name:json-key:version-stage:version-id.

This mechanism works with arn:aws:lambda:us-east-2:254067382080:layer:splunk-apm:222. Trying this in the latest version of the Lambda Layer arn:aws:lambda:us-east-2:254067382080:layer:splunk-apm:365 this is what we see in the logs.

Lambda starts up

INIT_START Runtime Version: python:3.9.v18  Runtime Version ARN: arn:aws:lambda:us-east-2::runtime:edb5a058bfa782cb9cedc6d534ac8b8c193bc28e9a9879d9f5ebaaf619cd0fc0

We see this error which we've always got and doesn't seem to cause a problem, but would be nice if we didn't see this.

2023/03/23 01:13:16 [ERROR] Exporter endpoint must be set when SPLUNK_REALM is not set. To export data, set either a realm and access token or a custom exporter endpoint.

The commit sha of the Splunk wrapper is logged

[splunk-extension-wrapper] splunk-extension-wrapper, version: 4552de7

The OTEL collector listening on localhost starts up successfully. The SPLUNK_ACCESS_TOKEN is not set yet in our case.

{
    "level": "info",
    "ts": 1679533996.8630877,
    "msg": "Launching OpenTelemetry Lambda extension",
    "version": "v0.69.1"
}

{
    "level": "info",
    "ts": 1679533996.8672311,
    "logger": "telemetryAPI.Listener",
    "msg": "Listening for requests",
    "address": "sandbox:53612"
}

{
    "level": "info",
    "ts": 1679533996.8673244,
    "logger": "telemetryAPI.Client",
    "msg": "Subscribing",
    "baseURL": "http://127.0.0.1:9001/2022-07-01/telemetry"
}

TELEMETRY   Name: collector State: Subscribed   Types: [Platform]
{
    "level": "info",
    "ts": 1679533996.8688502,
    "logger": "telemetryAPI.Client",
    "msg": "Subscription success",
    "response": "\"OK\""
}

{
    "level": "info",
    "ts": 1679533996.874017,
    "caller": "service/telemetry.go:90",
    "msg": "Setting up own telemetry..."
}

{
    "level": "Basic",
    "ts": 1679533996.8743467,
    "caller": "service/telemetry.go:116",
    "msg": "Serving Prometheus metrics",
    "address": ":8888"
}

{
    "level": "info",
    "ts": 1679533996.8772216,
    "caller": "service/service.go:128",
    "msg": "Starting otelcol-lambda...",
    "Version": "v0.69.1",
    "NumCPU": 2
}

{
    "level": "info",
    "ts": 1679533996.8773112,
    "caller": "extensions/extensions.go:41",
    "msg": "Starting extensions..."
}

{
    "level": "info",
    "ts": 1679533996.8773668,
    "caller": "service/pipelines.go:86",
    "msg": "Starting exporters..."
}

{
    "level": "info",
    "ts": 1679533996.877425,
    "caller": "service/pipelines.go:90",
    "msg": "Exporter is starting...",
    "kind": "exporter",
    "data_type": "traces",
    "name": "otlphttp"
}

{
    "level": "info",
    "ts": 1679533996.8788476,
    "caller": "service/pipelines.go:94",
    "msg": "Exporter started.",
    "kind": "exporter",
    "data_type": "traces",
    "name": "otlphttp"
}

{
    "level": "info",
    "ts": 1679533996.8789244,
    "caller": "service/pipelines.go:98",
    "msg": "Starting processors..."
}

{
    "level": "info",
    "ts": 1679533996.8789926,
    "caller": "service/pipelines.go:110",
    "msg": "Starting receivers..."
}

{
    "level": "info",
    "ts": 1679533996.8790362,
    "caller": "service/pipelines.go:114",
    "msg": "Receiver is starting...",
    "kind": "receiver",
    "name": "otlp",
    "pipeline": "traces"
}

{
    "level": "warn",
    "ts": 1679533996.8790877,
    "caller": "internal/warning.go:51",
    "msg": "Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks",
    "kind": "receiver",
    "name": "otlp",
    "pipeline": "traces",
    "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks"
}

{
    "level": "info",
    "ts": 1679533996.8791919,
    "caller": "otlpreceiver@v0.70.0/otlp.go:94",
    "msg": "Starting GRPC server",
    "kind": "receiver",
    "name": "otlp",
    "pipeline": "traces",
    "endpoint": "0.0.0.0:4317"
}

{
    "level": "warn",
    "ts": 1679533996.8792677,
    "caller": "internal/warning.go:51",
    "msg": "Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks",
    "kind": "receiver",
    "name": "otlp",
    "pipeline": "traces",
    "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks"
}

{
    "level": "info",
    "ts": 1679533996.8793197,
    "caller": "otlpreceiver@v0.70.0/otlp.go:112",
    "msg": "Starting HTTP server",
    "kind": "receiver",
    "name": "otlp",
    "pipeline": "traces",
    "endpoint": "0.0.0.0:4318"
}

{
    "level": "info",
    "ts": 1679533996.879386,
    "caller": "service/pipelines.go:118",
    "msg": "Receiver started.",
    "kind": "receiver",
    "name": "otlp",
    "pipeline": "traces"
}

{
    "level": "info",
    "ts": 1679533996.8794274,
    "caller": "service/service.go:145",
    "msg": "Everything is ready. Begin running and processing data."
}

Our own wrapper starts executing, fetching the token from the input secret and setting the SPLUNK_ACCESS_TOKEN environment variable

[WRAPPER] - INFO - START
[WRAPPER] - INFO - Fetching Splunk token
[WRAPPER] - INFO - Fetching arn:aws:secretsmanager:us-east-2:my-aws-acct-id:secret:splunk-token-secret
[WRAPPER] - INFO - END

The Splunk extension begins executing as called from our own wrapper. The the change in #114 this script is unsettling SPLUNK_ACCESS_TOKEN so traces are sent to localhost collector that is already set up with the token.

EXTENSION   Name: collector State: Ready    Events: [INVOKE, SHUTDOWN]
EXTENSION   Name: splunk-extension-wrapper  State: Ready    Events: [INVOKE, SHUTDOWN]

We get a request. Ingesting traces through the localhost collector errors with 401 unauthorized and eventual times out the Lambda in retry policies.

START RequestId: 2bdc5088-8c42-42eb-9013-79f41f191fd4 Version: $LATEST

[WARNING]   2023-03-23T01:13:20.564Z    2bdc5088-8c42-42eb-9013-79f41f191fd4    Invalid type NoneType for attribute value. Expected one of ['bool', 'str', 'bytes', 'int', 'float'] or a sequence of those types
{
    "level": "error",
    "ts": 1679534000.7317784,
    "caller": "exporterhelper/queued_retry.go:394",
    "msg": "Exporting failed. The error is not retryable. Dropping data.",
    "kind": "exporter",
    "data_type": "traces",
    "name": "otlphttp",
    "error": "Permanent error: error exporting items, request to https://ingest.us1.signalfx.com:443/v2/trace/otlp responded with HTTP Status Code 401",
    "dropped_items": 8,
    "stacktrace": "go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send\n\tgo.opentelemetry.io/collector@v0.70.0/exporter/exporterhelper/queued_retry.go:394\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send\n\tgo.opentelemetry.io/collector@v0.70.0/exporter/exporterhelper/traces.go:137\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).send\n\tgo.opentelemetry.io/collector@v0.70.0/exporter/exporterhelper/queued_retry.go:294\ngo.opentelemetry.io/collector/exporter/exporterhelper.NewTracesExporter.func2\n\tgo.opentelemetry.io/collector@v0.70.0/exporter/exporterhelper/traces.go:116\ngo.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces\n\tgo.opentelemetry.io/collector/consumer@v0.70.0/traces.go:36\ngo.opentelemetry.io/collector/receiver/otlpreceiver/internal/trace.(*Receiver).Export\n\tgo.opentelemetry.io/collector/receiver/otlpreceiver@v0.70.0/internal/trace/otlp.go:55\ngo.opentelemetry.io/collector/receiver/otlpreceiver.handleTraces\n\tgo.opentelemetry.io/collector/receiver/otlpreceiver@v0.70.0/otlphttp.go:47\ngo.opentelemetry.io/collector/receiver/otlpreceiver.(*otlpReceiver).registerTraceConsumer.func1\n\tgo.opentelemetry.io/collector/receiver/otlpreceiver@v0.70.0/otlp.go:210\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2084\nnet/http.(*ServeMux).ServeHTTP\n\tnet/http/server.go:2462\ngo.opentelemetry.io/collector/config/confighttp.(*decompressor).wrap.func1\n\tgo.opentelemetry.io/collector@v0.70.0/config/confighttp/compression.go:162\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2084\ngo.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*Handler).ServeHTTP\n\tgo.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.37.0/handler.go:210\ngo.opentelemetry.io/collector/config/confighttp.(*clientInfoHandler).ServeHTTP\n\tgo.opentelemetry.io/collector@v0.70.0/config/confighttp/clientinfohandler.go:39\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2916\nnet/http.(*conn).serve\n\tnet/http/server.go:1966"
}

{
    "level": "error",
    "ts": 1679534000.731938,
    "caller": "exporterhelper/queued_retry.go:296",
    "msg": "Exporting failed. Dropping data. Try enabling sending_queue to survive temporary failures.",
    "kind": "exporter",
    "data_type": "traces",
    "name": "otlphttp",
    "dropped_items": 8,
    "stacktrace": "go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).send\n\tgo.opentelemetry.io/collector@v0.70.0/exporter/exporterhelper/queued_retry.go:296\ngo.opentelemetry.io/collector/exporter/exporterhelper.NewTracesExporter.func2\n\tgo.opentelemetry.io/collector@v0.70.0/exporter/exporterhelper/traces.go:116\ngo.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces\n\tgo.opentelemetry.io/collector/consumer@v0.70.0/traces.go:36\ngo.opentelemetry.io/collector/receiver/otlpreceiver/internal/trace.(*Receiver).Export\n\tgo.opentelemetry.io/collector/receiver/otlpreceiver@v0.70.0/internal/trace/otlp.go:55\ngo.opentelemetry.io/collector/receiver/otlpreceiver.handleTraces\n\tgo.opentelemetry.io/collector/receiver/otlpreceiver@v0.70.0/otlphttp.go:47\ngo.opentelemetry.io/collector/receiver/otlpreceiver.(*otlpReceiver).registerTraceConsumer.func1\n\tgo.opentelemetry.io/collector/receiver/otlpreceiver@v0.70.0/otlp.go:210\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2084\nnet/http.(*ServeMux).ServeHTTP\n\tnet/http/server.go:2462\ngo.opentelemetry.io/collector/config/confighttp.(*decompressor).wrap.func1\n\tgo.opentelemetry.io/collector@v0.70.0/config/confighttp/compression.go:162\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2084\ngo.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*Handler).ServeHTTP\n\tgo.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.37.0/handler.go:210\ngo.opentelemetry.io/collector/config/confighttp.(*clientInfoHandler).ServeHTTP\n\tgo.opentelemetry.io/collector@v0.70.0/config/confighttp/clientinfohandler.go:39\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2916\nnet/http.(*conn).serve\n\tnet/http/server.go:1966"
}

[WARNING]   2023-03-23T01:13:20.734Z    2bdc5088-8c42-42eb-9013-79f41f191fd4    Transient error Internal Server Error encountered while exporting span batch, retrying in 1s.
[WARNING]   2023-03-23T01:13:21.797Z    2bdc5088-8c42-42eb-9013-79f41f191fd4    Transient error Internal Server Error encountered while exporting span batch, retrying in 2s.
[WARNING]   2023-03-23T01:13:23.856Z    2bdc5088-8c42-42eb-9013-79f41f191fd4    Transient error Internal Server Error encountered while exporting span batch, retrying in 4s.
[WARNING]   2023-03-23T01:13:27.919Z    2bdc5088-8c42-42eb-9013-79f41f191fd4    Transient error Internal Server Error encountered while exporting span batch, retrying in 8s.
{
    "level": "error",
    "ts": 1679534015.9845555,
    "caller": "exporterhelper/queued_retry.go:394",
    "msg": "Exporting failed. The error is not retryable. Dropping data.",
    "kind": "exporter",
    "data_type": "traces",
    "name": "otlphttp",
    "error": "Permanent error: error exporting items, request to https://ingest.us1.signalfx.com:443/v2/trace/otlp responded with HTTP Status Code 401",
    "dropped_items": 8,
    "stacktrace": "go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send\n\tgo.opentelemetry.io/collector@v0.70.0/exporter/exporterhelper/queued_retry.go:394\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send\n\tgo.opentelemetry.io/collector@v0.70.0/exporter/exporterhelper/traces.go:137\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).send\n\tgo.opentelemetry.io/collector@v0.70.0/exporter/exporterhelper/queued_retry.go:294\ngo.opentelemetry.io/collector/exporter/exporterhelper.NewTracesExporter.func2\n\tgo.opentelemetry.io/collector@v0.70.0/exporter/exporterhelper/traces.go:116\ngo.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces\n\tgo.opentelemetry.io/collector/consumer@v0.70.0/traces.go:36\ngo.opentelemetry.io/collector/receiver/otlpreceiver/internal/trace.(*Receiver).Export\n\tgo.opentelemetry.io/collector/receiver/otlpreceiver@v0.70.0/internal/trace/otlp.go:55\ngo.opentelemetry.io/collector/receiver/otlpreceiver.handleTraces\n\tgo.opentelemetry.io/collector/receiver/otlpreceiver@v0.70.0/otlphttp.go:47\ngo.opentelemetry.io/collector/receiver/otlpreceiver.(*otlpReceiver).registerTraceConsumer.func1\n\tgo.opentelemetry.io/collector/receiver/otlpreceiver@v0.70.0/otlp.go:210\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2084\nnet/http.(*ServeMux).ServeHTTP\n\tnet/http/server.go:2462\ngo.opentelemetry.io/collector/config/confighttp.(*decompressor).wrap.func1\n\tgo.opentelemetry.io/collector@v0.70.0/config/confighttp/compression.go:162\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2084\ngo.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*Handler).ServeHTTP\n\tgo.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.37.0/handler.go:210\ngo.opentelemetry.io/collector/config/confighttp.(*clientInfoHandler).ServeHTTP\n\tgo.opentelemetry.io/collector@v0.70.0/config/confighttp/clientinfohandler.go:39\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2916\nnet/http.(*conn).serve\n\tnet/http/server.go:1966"
}

{
    "level": "error",
    "ts": 1679534015.984702,
    "caller": "exporterhelper/queued_retry.go:296",
    "msg": "Exporting failed. Dropping data. Try enabling sending_queue to survive temporary failures.",
    "kind": "exporter",
    "data_type": "traces",
    "name": "otlphttp",
    "dropped_items": 8,
    "stacktrace": "go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).send\n\tgo.opentelemetry.io/collector@v0.70.0/exporter/exporterhelper/queued_retry.go:296\ngo.opentelemetry.io/collector/exporter/exporterhelper.NewTracesExporter.func2\n\tgo.opentelemetry.io/collector@v0.70.0/exporter/exporterhelper/traces.go:116\ngo.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces\n\tgo.opentelemetry.io/collector/consumer@v0.70.0/traces.go:36\ngo.opentelemetry.io/collector/receiver/otlpreceiver/internal/trace.(*Receiver).Export\n\tgo.opentelemetry.io/collector/receiver/otlpreceiver@v0.70.0/internal/trace/otlp.go:55\ngo.opentelemetry.io/collector/receiver/otlpreceiver.handleTraces\n\tgo.opentelemetry.io/collector/receiver/otlpreceiver@v0.70.0/otlphttp.go:47\ngo.opentelemetry.io/collector/receiver/otlpreceiver.(*otlpReceiver).registerTraceConsumer.func1\n\tgo.opentelemetry.io/collector/receiver/otlpreceiver@v0.70.0/otlp.go:210\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2084\nnet/http.(*ServeMux).ServeHTTP\n\tnet/http/server.go:2462\ngo.opentelemetry.io/collector/config/confighttp.(*decompressor).wrap.func1\n\tgo.opentelemetry.io/collector@v0.70.0/config/confighttp/compression.go:162\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2084\ngo.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*Handler).ServeHTTP\n\tgo.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.37.0/handler.go:210\ngo.opentelemetry.io/collector/config/confighttp.(*clientInfoHandler).ServeHTTP\n\tgo.opentelemetry.io/collector@v0.70.0/config/confighttp/clientinfohandler.go:39\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2916\nnet/http.(*conn).serve\n\tnet/http/server.go:1966"
}

[WARNING]   2023-03-23T01:13:35.985Z    2bdc5088-8c42-42eb-9013-79f41f191fd4    Transient error Internal Server Error encountered while exporting span batch, retrying in 16s.
[WARNING]   2023-03-23T01:13:50.564Z    2bdc5088-8c42-42eb-9013-79f41f191fd4    Timeout was exceeded in force_flush().
END RequestId: 2bdc5088-8c42-42eb-9013-79f41f191fd4
tsloughter-splunk commented 1 year ago

I'm thinking something similar to what you were doing before but as a wrapper for starting the collector would be best. As in, if there is a way to have your wrapper run before the collector so it can set the access token. What do you think?

dude0001 commented 1 year ago

I agree with that idea. This crossed my mind and is my inquiry "Is there a way we can delay the OTEL collector starting up?". I think in general it would be nice to be able to control when the collector starts up if needed. Another reason is to be able to redirect the logs. We have another compliance issue all our logs should be going to Splunk, not CloudWatch. So the collector and the wrapper code in this Layer logging to CloudWatch is problematic for us. That might be a separate issue we need to open, but another benefit to having some control over when the collector starts.

We were sending traces directly from our app to SignalFX. I definitely see the value in routing through the collector and that being an async process. I think the original change that broke us is a good direction for this Layer.

tsloughter-splunk commented 1 year ago

@dude0001 ah, the logs issue should be resolvable with a custom collector config. We can make this simpler with following versions.

dude0001 commented 1 year ago

@tsloughter-splunk should I create a separate issue for the logs concern? Is there an example of using a custom collector config?

tsloughter-splunk commented 1 year ago

@dude0001 yes, another issue would be good for tracking this. Sadly the custom collector config isn't actually going to work (at this time). There is work on the OpenTelemetry collector needed first. I've been trying to come up with a suggestion for the time being that doesn't hit CloudWatch but there may not be a good one. Disabling CloudWatch will just lose the collector logs and I doubt that is acceptable? So it may be that until there is a way to do this with the collector a way to bypass it is needed.

dude0001 commented 1 year ago

I created #132 for the logs issue.