Enable Splunk logging for Layer0

diemonster commented 6 years ago

Several Layer0 users (i.e. @andycmaj and @mrjavaguy ) have requested Splunk integration with Layer0, which would most likely entail replicating Cloudwatch entries to Splunk via lambda or a similar mechanism.

Splunk has a decent guide on replicating Cloudwatch logs to Splunk.

Not sure how we'd want to implement configuring this for a given Layer0 instance.... per environment could work well potentially?

andycmaj commented 6 years ago

per environment would be great. that way you don't spam with dev environments unless you opt-in.

Questions:

what are you thinking in terms of routing to a specific Splunk instance?
what would the events look like? It would be nice to get a preview of the event schema and iterate on that here...

andycmaj commented 6 years ago

Another thought... allow users to plug in a log provider of their choosing?

tribaljack commented 6 years ago

One ask: container metrics (CPU/Memory/Disk, etc.), aggregate by service and version.

sesh-kebab commented 6 years ago

Proposed Design to enable replicating layer0 logs, currently captured in Cloudwatch, to Splunk

Ideally leverage the existing logging pipeline
Ideally wrap logs into structured log events with L0 environment, service & task details

Proposal A

The first and simpler proposal would allow the user to specify the logConfiguration attribute in a given task definition, when creating a Layer0 deploy.

Pros:

Ultimate flexibility (can use any provider not just Splunk)
Minimal development effort to enable

Cons:

Logging cannot be toggled on/off once a Deploy has been created
The logs wouldn't be structured the same way when logs are written via ImsHealth.Logging library
Not re-using existing pipeline (not great for blue green deployments) when writing directly to the Splunk HEC endpoint. Writing directly to Splunk HEC isn't desirable (potential scaling issue).
Would require sharing Splunk HEC keys

Proposal B (log forwarding terraform module specific to our logging pipeline)

Current Logging pipeline

App/Service -> Kinesis Stream -> Lambda -> Splunk

Leverage the existing pipeline by forwarding the Cloudwatch logs to one of the Kinesis streams (ppe, development, production etc).

             (existing pipeline) App/Service -> Kinesis Stream -> Lambda -> Splunk
                                                 ^
                                                 |
Layer0 Hosted -> Cloudwatch -> Lambda (L0 logs) -

New Standalone Log Forwarder Terraform module

Would take as inputs, the L0 instance name, stream name and AWS creds to create a new Lambda and Cloudwatch trigger that would process and forward logs to a kinesis stream.

Pros:

The L0 logs lambda can augment the log event with additional metadata like l0 instance, environment, service etc before it is forwarded to a stream
Re-use existing logging pipeline (scales better)

Cons:

Requires more development effort that Proposal A.

diemonster commented 6 years ago

@sesh-kebab my vote is proposal B.

Reason being: The additional metadata would be helpful given that we're planning on getting rid of hashIDs for layer0 entities. But also, just having that metadata makes searching and such so much nicer (I'm imagining). Secondly, re-using our existing logging pipeline seems like the "right" thing to do in the longer term for many reasons, esp since blue/green deployments should be our default for safely updating services.

I'm sure someone could argue that fixing to Splunk specifically was short-sighted, but if proposal A doesn't require that much dev time, it seems like we could create another issue and address that when we need it.

quintilesims / layer0