matanolabs / matano

Open source security data lake for threat hunting, detection & response, and cybersecurity analytics at petabyte scale on AWS
https://matano.dev
Apache License 2.0
1.47k stars 101 forks source link

Support lookup metadata from file/payload to enrich events for sources such as AWS ELB #91

Open shaeqahmed opened 1 year ago

shaeqahmed commented 1 year ago

Problem

AWS ELB does not include AWS account ID in each event payload, this information is only included in the path e.g. aws-elb-logs/<account-id>/.... As a user, I would like to be able to query my AWS ELB logs using an AWS account ID field to filter/narrow down events.

Ideas

To support this in a generic way in our VRL transform, without impacting performance (requiring synchronization of threads in the hot path via a mutex) we would need to add a custom VRL function for looking up me (get_payload_metadata_field). We would also add a function like set_payload_metadata that could be used from the select_table_from_payload_metadata VRL expression to parse and populate some file level metadata that the corresponding events can lookup later. For example for AWS ELB this may look like (psuedoscript):

ingest:
    select_table_from_payload_metadata: |
        # inject additional metadata by parsing s3 key and extracting aws_account_id and adding it to the `__metadata` special field
        .__metadata |= parse_regex(.__metadata.s3.key, r'/AWSLogs/(?P<aws_account_id>\d{12})/elasticloadbalancing/') ?? {}

        # just return the default table
        "default"

Then from the transform we could use this info like:

transform: |
    .cloud.account.id = get_payload_metadata_field("aws_account_id")

This is a bit too complicated for my liking though and this is a pretty niche use case (only current applications are AWS ELB and Route53 potentially. Generally other sources do (and should) include important metadata in the event rather than relying on the bubbling up context from the path, so I'd like to hold off on implementing a solution for this until it becomes clearer it is worth it.

rams3sh commented 1 year ago

I was thinking on a below idea :-

Matano provides a context data (such as object path, s3 bucket , any other stuffs that provides significant value from transforming standpoint [the fields can keep growing if needed in future] ) for every event to the transformer. However this is not enabled by default. This can be part of log schema with a flag provide_matano_context: true which has to be explicitly set to avoid overhead of generating metadata by matano for every event / log. This data can be part of .matano.context or __metadata field of the event that is to be transformed (hoping that matano.context or __metadata is not used in any log structure so that it does not come in way of genuine log fields). This raw context can be transformed as part of VRL code by the user. This will help in cases as above where a certain context has to be derived from contextual data provided by matano. This will also avoid tampering or hacking on select_table_from_payload_metadata.