Open dude0001 opened 1 year ago
I'm thinking something similar to what you were doing before but as a wrapper for starting the collector would be best. As in, if there is a way to have your wrapper run before the collector so it can set the access token. What do you think?
I agree with that idea. This crossed my mind and is my inquiry "Is there a way we can delay the OTEL collector starting up?". I think in general it would be nice to be able to control when the collector starts up if needed. Another reason is to be able to redirect the logs. We have another compliance issue all our logs should be going to Splunk, not CloudWatch. So the collector and the wrapper code in this Layer logging to CloudWatch is problematic for us. That might be a separate issue we need to open, but another benefit to having some control over when the collector starts.
We were sending traces directly from our app to SignalFX. I definitely see the value in routing through the collector and that being an async process. I think the original change that broke us is a good direction for this Layer.
@dude0001 ah, the logs issue should be resolvable with a custom collector config. We can make this simpler with following versions.
@tsloughter-splunk should I create a separate issue for the logs concern? Is there an example of using a custom collector config?
@dude0001 yes, another issue would be good for tracking this. Sadly the custom collector config isn't actually going to work (at this time). There is work on the OpenTelemetry collector needed first. I've been trying to come up with a suggestion for the time being that doesn't hit CloudWatch but there may not be a good one. Disabling CloudWatch will just lose the collector logs and I doubt that is acceptable? So it may be that until there is a way to do this with the collector a way to bypass it is needed.
I created #132 for the logs issue.
In our environment, we are asked to not put the ingestion token as plaintext in the SPLUNK_ACCESS_TOKEN environment variable, as anyone is able to read this in the AWS console or APIs when describing the Lambda. To work around this, we created our own Lambda layer which is a Lambda execution wrapper around the Splunk provided wrapper. Our own wrapper expects an AWS Secrets Manager ARN as an environment variable. It then fetches the secret, parses out the token and sets the SPLUNK_ACCESS_TOKEN environment variable. Our wrapper then calls the Splunk wrapper to continue as normal.
The change in #114 has broken this flow for us. It looks like the OTEL collector starts up before our own wrapper is able to execute and set up the environment variable.
Is there a way we can delay the OTEL collector starting up? Is there another way to keep the token secret and out of the AWS Lambda console as plaintext?
Or can a mechanism be added to the Lambda Layer that can fetch the token from a secret input as an environment variable? The script could either use the plaintext value of the secret, or expert JSON and use syntax similar to what AWS ECS uses that expects the secret value to be JSON and which key to pull the token from. e.g.
arn:aws:secretsmanager:region:aws_account_id:secret:secret-name:json-key:version-stage:version-id
.This mechanism works with
arn:aws:lambda:us-east-2:254067382080:layer:splunk-apm:222
. Trying this in the latest version of the Lambda Layerarn:aws:lambda:us-east-2:254067382080:layer:splunk-apm:365
this is what we see in the logs.Lambda starts up
We see this error which we've always got and doesn't seem to cause a problem, but would be nice if we didn't see this.
The commit sha of the Splunk wrapper is logged
The OTEL collector listening on localhost starts up successfully. The SPLUNK_ACCESS_TOKEN is not set yet in our case.
Our own wrapper starts executing, fetching the token from the input secret and setting the SPLUNK_ACCESS_TOKEN environment variable
The Splunk extension begins executing as called from our own wrapper. The the change in #114 this script is unsettling SPLUNK_ACCESS_TOKEN so traces are sent to localhost collector that is already set up with the token.
We get a request. Ingesting traces through the localhost collector errors with 401 unauthorized and eventual times out the Lambda in retry policies.