schlarpc / overengineered-cloudfront-s3-static-website

The objectively correct static website host with S3 + CloudFront
5 stars 0 forks source link

Parse logs to JSON before pushing to CloudWatch Logs #4

Open schlarpc opened 2 years ago

schlarpc commented 2 years ago

Currently, logs are parsed just enough to extract timestamps before pushing them to CWL. JSON parsing would enable more effective queries through CW Insights.

Things to think about:

schlarpc commented 2 years ago

S3 access logs are impossible to parse robustly.

For instance, if a request's User-Agent contains a double quote, it is unescaped and indistinguishable from a closing quote:

... "aws-cli/1.20.31 Python/3.9.6 Linux/5.10.60.1-microsoft-standard-WSL2 exec-env/[bracket]"quote"'squote' botocore/1.21.31" ...

This might be workable by looking at more fixed elements of the log line, but the S3 docs advise to write a parser that can handle any number of elements:

Occasionally we might extend the access log record format by adding new fields to the end of each line. Therefore, you should write any code that parses server access logs to handle trailing fields that it might not understand.

Since there is no functional escaping, earlier elements in a line can inject values across other elements, leaving the (original) trailing elements to be interpreted as "extensions" at the end of the line.

CloudTrail data events (or EventBridge, which relies on CloudTrail data events) seem like the only robust method to get well-formed S3 access logs, but I'm hesitant to implement this for a few reasons: