matanolabs / matano

Open source security data lake for threat hunting, detection & response, and cybersecurity analytics at petabyte scale on AWS
https://matano.dev
Apache License 2.0
1.46k stars 100 forks source link

SQS Ingestion CDK Modifications #23

Closed kai-ten closed 1 year ago

kai-ten commented 1 year ago

The reasons for the naming difference on line 155 - versus the normal pattern on 156 - of DPMainStack is worth noting, as it vital to passing metadata to the Transformer lambda. Here's why -

Consider the following queue names that were generated in testing:

Notice that right after “-“, the log_source.name and a table from that log_source is joined together as lower case chars. In the lambda code, extract the substr between the “-“ and the “M” in Matano to get necessary metadata from an SQS message (see 1 below).

The current limitations / workarounds:

  1. Not able to match against any other metadata in the SQS message that can tell the Transformer lambda to discern S3 from SQS other than event_source_arn. This is because we cannot rely on current log producers (vector.dev, fluentd, file system forwarders, datadog, etc) to support SQS / message_attributes.
  2. Must create the metadata as early as possible in the queue name for both efficiency and because the max number of chars in a queue name is 80. Since MatanoDPMainStack- is 18 chars, log source and table name for sqs is limited to 62 chars.
shaeqahmed commented 1 year ago

As you have mentioned about the char limitations, i think relying on parsing a substring from the queue name is too hacky. Since we are using CDK, can you instead pass an environment variable the is a JSON map of the queue name to matano table as static configuration that can be referenced in Transformer? For example:

 {
   "$queue-name-1-cdk-token": [<log_source1>, <table1>],
$queue-name-2-cdk-token": [<log_source2>, <table2>]  
  ...
}
``

This would be a more stable solution

kai-ten commented 1 year ago

I like it, that's a much better thought than where this was going.

Is there any particular reason for passing the env variables to MatanoTable rather than into the Transformer lambda env vars? Just curious if there is anything deeper to your recommendation.. asking because at a glance it may be more convenient to pass the queue names / table names into the environment variables of the lambda. Won't be able to dig deeper until later today though

shaeqahmed commented 1 year ago

We are on the same page here. That's exactly what I meant, pass a "map of the queue name to matano table" e.g QUEUE_TO_TABLE = {...} as an environment variable to the transformer lambda where it can be used.

kai-ten commented 1 year ago

Latest and greatest - sticking closer to S3Sources pattern to simplify the dependency the Transformer lambda has on the SQSSources

Should probably think about build speed optimizations over time, but would rather move faster for the time being if there are no objections (and please excuse the rebase -rather than merging- lol, i worked alone for too long and have developed some bad habits)

Samrose-Ahmed commented 1 year ago

Rebase and force push is fine but I think you've jumbled the commit order in the rebase the PR should just contain your commits but it has some commits from the main branch interspersed with unmerged commits. Can you take a look?

kai-ten commented 1 year ago

Oh that's interesting, I'll take a look and open up a fresh PR once I get back from my flight. Looks good otherwise?

Samrose-Ahmed commented 1 year ago

👍 LGTM otherwise.