quintilesims / layer0

Build, Manage, and Deploy Your Applications
Apache License 2.0
44 stars 20 forks source link

Enable Splunk logging for Layer0 #419

Closed diemonster closed 6 years ago

diemonster commented 6 years ago

Several Layer0 users (i.e. @andycmaj and @mrjavaguy ) have requested Splunk integration with Layer0, which would most likely entail replicating Cloudwatch entries to Splunk via lambda or a similar mechanism.

Splunk has a decent guide on replicating Cloudwatch logs to Splunk.

Not sure how we'd want to implement configuring this for a given Layer0 instance.... per environment could work well potentially?

andycmaj commented 6 years ago

per environment would be great. that way you don't spam with dev environments unless you opt-in.

Questions:

andycmaj commented 6 years ago

Another thought... allow users to plug in a log provider of their choosing?

tribaljack commented 6 years ago

One ask: container metrics (CPU/Memory/Disk, etc.), aggregate by service and version.

sesh-kebab commented 6 years ago

Proposed Design to enable replicating layer0 logs, currently captured in Cloudwatch, to Splunk

Proposal A

The first and simpler proposal would allow the user to specify the logConfiguration attribute in a given task definition, when creating a Layer0 deploy.

Pros:

Cons:

Proposal B (log forwarding terraform module specific to our logging pipeline)

Current Logging pipeline

App/Service -> Kinesis Stream -> Lambda -> Splunk

Leverage the existing pipeline by forwarding the Cloudwatch logs to one of the Kinesis streams (ppe, development, production etc).

             (existing pipeline) App/Service -> Kinesis Stream -> Lambda -> Splunk
                                                 ^
                                                 |
Layer0 Hosted -> Cloudwatch -> Lambda (L0 logs) -

New Standalone Log Forwarder Terraform module

Pros:

Cons:

diemonster commented 6 years ago

@sesh-kebab my vote is proposal B.

Reason being: The additional metadata would be helpful given that we're planning on getting rid of hashIDs for layer0 entities. But also, just having that metadata makes searching and such so much nicer (I'm imagining). Secondly, re-using our existing logging pipeline seems like the "right" thing to do in the longer term for many reasons, esp since blue/green deployments should be our default for safely updating services.

I'm sure someone could argue that fixing to Splunk specifically was short-sighted, but if proposal A doesn't require that much dev time, it seems like we could create another issue and address that when we need it.