matanolabs / matano

Open source security data lake for threat hunting, detection & response, and cybersecurity analytics at petabyte scale on AWS
https://matano.dev
Apache License 2.0
1.47k stars 100 forks source link

Error parsing compressed file containing Cloudwatch event #13

Closed marcin-kwasnicki closed 1 year ago

marcin-kwasnicki commented 2 years ago

Hello,

I ran into this issue while testing matano on some sample log files. TransformerLambda fails with the message: thread 'main' panicked at 'calledResult::unwrap()on anErrvalue: stream did not contain valid UTF-8', transformer/src/main.rs:538:58

The file that I want to parse is delivered by Kinesis Firehose and it is Cloudtrail logs streamed from Cloudwatch to S3. It doesn't have an extension and content-type is marked as 'application/octet-stream'. Inside there is JSON file represting Cloudwatch event. Important note on that type of file can found here: https://docs.aws.amazon.com/firehose/latest/dev/writing-with-cloudwatch-logs.html. "CloudWatch log events are compressed with gzip level 6. If you want to specify OpenSearch Service or Splunk as the destination for the delivery stream, use a Lambda function to uncompress the records to UTF-8.and single-line JSON" I suspect that maybe some additonal parsing is required for that type of file.

Samrose-Ahmed commented 2 years ago

We should support this. We can either make the parsing more advanced and figure out the compression from some magic bytes (believe there's a crate for this) or just simply allow the user to explicitly specify the compression for a log source in the configuration. @shaeqahmed will look into this.

Samrose-Ahmed commented 2 years ago

I have added https://github.com/matanolabs/matano/commit/c7e58d7e2e21d407bc997bc55e557b6b3a01309b .

So you can now add to your log_source.yml:

# log_source.yml
ingest:
  compression: "gzip"

and Matano will use that compression.

Leaving issue open if we want to add the more advanced auto compression inference.

shaeqahmed commented 2 years ago

I have added https://github.com/matanolabs/matano/commit/c7e58d7e2e21d407bc997bc55e557b6b3a01309b .

So you can now add to your log_source.yml:

# log_source.yml
ingest:
  compression: "gzip"

and Matano will use that compression.

Leaving issue open if we want to add the more advanced auto compression inference.

This configuration has been replaced with compression auto-inference, so manually specifying the compression format in the log source is no longer necessary 💯

shaeqahmed commented 1 year ago

Confirmed that this is issue has been fixed, as well as added an automated method to parse cloudwatch logs written to S3 from a subscription for line-by-line consumption in Matno by using flag:

ingest:
    s3_source:
        is_from_cloudwatch_log_subscription: true