Amazon Security Lake integration - DTD - AWS Lambda

wazuh / wazuh-indexer

Wazuh indexer, the Wazuh search engine

https://opensearch.org/docs/latest/opensearch/index/

Apache License 2.0

6 stars 16 forks source link

Amazon Security Lake integration - DTD - AWS Lambda #146

Closed AlexRuiz7 closed 2 months ago

AlexRuiz7 commented 5 months ago

Description

Our first approach to transform the data to OCSF and Apache is to use a Lambda function that reads our data from an auxiliary S3 bucket fed by Logstash, and upload it to the final Amazon Security Lake S3 bucket.

We think this approach is the fastest way to complete the integration, although it's the most expensive in terms of resources.

Functional requirements

Implement a Lambda function that:

Read data from the auxiliary S3 bucket.
Transform our data to an OCSF compatible schema.
Encode our data as Apache Parquet.
Upload the data to another S3 bucket.

Implementation restrictions

The auxiliary bucket is fed by Logstash's pipeline created in #135.
Ensure data encryption is enabled and pay special attention to safe handling of AWS credentials.

AlexRuiz7 commented 5 months ago

Here's a very detailed tutorial about how to create and configure a Lambda function that reads objects from an S3 bucket, processes and stores them in another S3 bucket.

AlexRuiz7 commented 2 months ago

I'm currently doing progress on the implementation of the lambda function using our local Docker environment.

Access for a real AWS deployment has been requested in https://github.com/wazuh/internal-devel-requests/issues/1043

AlexRuiz7 commented 2 months ago

Reading https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html

AlexRuiz7 commented 2 months ago

Local Lambda invocation has been automated through a script.

bash amazon-security-lake/src/invoke-lambda.sh ::file::

AlexRuiz7 commented 2 months ago

The deployment zip package is larger than the 50 MB limit. We need to either upload the zip to an S3 bucket or split it into layers.

AlexRuiz7 commented 2 months ago

Uploaded to the aux S3 bucket

Once uploaded, load the zip into the Lambda by clicking on Upload from < Amazon S3 location.

It didn't work either. We'll try to reduce the zip size by removing unneeded libraries.

AlexRuiz7 commented 2 months ago

By removing boto3 and parquet-tools from the requirements.txt. boto3 is included already https://gist.github.com/gene1wood/4a052f39490fae00e0c3#file-all_aws_lambda_modules_python3-9-txt

zip size is down to 66 MB.

The zip file is still too big to be uploaded directly, but can be uploaded to an S3 bucket and upload it from there. We can dive into making it even lighter by using layers.

AlexRuiz7 commented 2 months ago

Added code to validate whether the destination S3 bucket name is set. Program exits if not, with appropriate logging.

[ERROR] 2024-04-18T15:38:10.063Z        50ab3aaf-77e5-4286-94d2-6506818ee9ad    Destination bucket not set. Please, set the AWS_BUCKET environment variable with the name of the Amazon Security Lake dedicated S3 bucket.
18 Apr 2024 15:38:10,063

{
  "success": false
}

AlexRuiz7 commented 2 months ago

After many tries. I managed to get it working on AWS.

Here's the output when the variable is not set:

And this one is when the execution succeeds:

AlexRuiz7 commented 2 months ago

The parquet file is written to the root of the S3 bucket. According to the Best Practices, objects should be partitioned by source location, AWS Region, AWS account, and date.

bucket-name/source-location/region=region/accountId=accountID/eventDay=YYYYMMDD

In order to do that, we'll need to add these environment variables

SOURCE_LOCATION
ACCOUNT_ID
IAM_ROLE_ARN : replaces AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in production.

Using PingOne's integration for reference:

https://github.com/pingone-davinci/pingone-amazon-security-lake

AlexRuiz7 commented 2 months ago

Parquet files are now uploaded to the correct path.

Note: execution environment was edited to use 512 MB and 30 seconds timeout.

AlexRuiz7 commented 2 months ago

AWS Lambda requirements

Runtime: Python 3.12
Architecture: x86_64
Memory: 512 MB
Timeout: 30 s
Trigger: S3 Aux bucket
- All object create events
- Suffix: .txt
Environment variables
- AWS_BUCKET: The name of the Amazon S3 bucket in which Security Lake stores your custom source data.
- AWS_REGION: AWS Region to which the data is written
- SOURCE_LOCATION: The Source Location configured in Security Lake during the Custom Source creation.
- ACCOUNT_ID: AWS account ID that the records in the source partition pertain to.
- IAM_ROLE_ARN: The AWS Role ARN for the IAM Role with access to write to the Security Lake Custom Source S3 bucket