Feature request: Add no-op support for collector lambda layer

jerrytfleung commented 8 months ago

Is your feature request related to a problem? Please describe. If Config.Validate() of a component returns false, the collector lambda layer cannot start in AWS lambda. As a result, the user lambda function is broken.

Describe the solution you'd like Depending on the component, an invalid component configuration may not need to fail the whole collector lambda layer. We could let that component run in no-op.

Describe alternatives you've considered Tried removing all config validation logic in the component and moved them to Start function. If config is invalid, just print a message instead. However, opentelemetry-collector-contrib code reviewer would like to check if there is other way to go.

Additional context PR review comment The component PR

serkan-ozal commented 2 months ago

I am not sure whether it is the correct approach to switch to noop mode when configuration in valid. Because it might be confusing for the users and as far as I know it doesn't align with the way of how OTEL configurations are handled.

Instead of noop, default values might be used for the invalid configs and fail fast if there is no default value for the invalid config.

WDYT @tylerbenson?

cheempz commented 2 months ago

I am not sure whether it is the correct approach to switch to noop mode when configuration in valid. Because it might be confusing for the users and as far as I know it doesn't align with the way of how OTEL configurations are handled.

Instead of noop, default values might be used for the invalid configs and fail fast if there is no default value for the invalid config.

WDYT @tylerbenson?

Adding some more context--it's reasonable for otelcol outside of Lambda to fail fast on invalid config, the only consequence is the collector doesn't run but it doesn't bring down the entire host. But in Lambda, the otelcol extension failing means the entire Lambda runtime crashes, kind of like crashing the entire VM because otelcol didn't start. To me this is pretty terrible user experience.

tylerbenson commented 2 months ago

I can see both arguments here, though I'm leaning towards fail fast being the better option. Might be worth discussing in the SIG meeting.

Lambda versions are generally immutable, so it's nice to know immediately if you configured something wrong. If a deployment is urgent, the rollback can be as easy as removing the collector layer and redeploying.

serkan-ozal commented 2 months ago

But in Lambda, the otelcol extension failing means the entire Lambda runtime crashes, kind of like crashing the entire VM because otelcol didn't start.

BTW, I am really not sure whether entire Lambda environment crashes if/when an extension fails gracefully (by calling /extension/init/error endpoint)

serkan-ozal commented 2 months ago

And also AWS Lambda encourages being fail fast for extensions: https://docs.aws.amazon.com/lambda/latest/dg/runtimes-extensions-api.html#runtimes-extensions-init-error

cheempz commented 2 months ago

That's good to know re: /extension/init/error endpoint, it seems the otelcol extension is already using it. From a quick test of an otelcol extension with misconfigured pipeline, it doesn't result in a crash but in an Extension.InitError:

Test Event Name
(unsaved) test event

Response
{
  "errorType": "Extension.InitError",
  "errorMessage": "RequestId: 32e01a48-1b65-4559-9ba6-ec4620f689d7 Error: exit code 0"
}

Function Logs
TELEMETRY   Name: collector State: Subscribed   Types: [Platform]
{"level":"warn","ts":1725481081.1169627,"logger":"lifecycle.manager","msg":"Failed to start the extension","error":"invalid configuration: service::pipelines::logs: references receiver \"telemetryapi\" which is not configured"}
EXTENSION   Name: collector State: InitError    Events: [INVOKE, SHUTDOWN]
INIT_REPORT Init Duration: 439.08 ms    Phase: init Status: error   Error Type: Extension.InitError
TELEMETRY   Name: collector State: Already subscribed   Types: [Platform]
{"level":"warn","ts":1725481086.9609792,"logger":"lifecycle.manager","msg":"Failed to start the extension","error":"unable to start, otelcol state is Closed"}
EXTENSION   Name: collector State: InitError    Events: [INVOKE, SHUTDOWN]
INIT_REPORT Init Duration: 5971.95 ms   Phase: invoke   Status: error   Error Type: Extension.InitError
START RequestId: 32e01a48-1b65-4559-9ba6-ec4620f689d7 Version: $LATEST
RequestId: 32e01a48-1b65-4559-9ba6-ec4620f689d7 Error: exit code 0
Extension.InitError
END RequestId: 32e01a48-1b65-4559-9ba6-ec4620f689d7
REPORT RequestId: 32e01a48-1b65-4559-9ba6-ec4620f689d7  Duration: 6031.77 ms    Billed Duration: 6032 ms    Memory Size: 128 MB Max Memory Used: 75 MB

Request ID
32e01a48-1b65-4559-9ba6-ec4620f689d7

Still, the end result is the application is unavailable, and I do think it's pretty disruptive even given the recourse available. It goes against the expectation that observability tools strive to cause as little disruption to the application as possible.

serkan-ozal commented 2 months ago

I still prefer fail fast when something is not configured properly. Pre-prod environments are there to catch such cases before happening in productions. If it is silently ignored (even though there are error logs), I am pretty sure that most of the people and companies will not notice it until they find out that they have missing traces after some time.

I agree that both of the approaches have their own pros and cons, but IMO, being aware of the issues earlier is more important than suppressing them.

open-telemetry / opentelemetry-lambda

Feature request: Add no-op support for collector lambda layer #1181