logstash-plugins / logstash-patterns-core

Apache License 2.0
2.17k stars 979 forks source link

Incorrect pattern for AWS CLOUDFRONT_ACCESS_LOG #232

Closed jpleger closed 3 years ago

jpleger commented 6 years ago

Not sure if its a version change or log format change on the AWS side, but currently there are a few fields that are incorrect pattern-wise for the CLOUDFRONT_ACCESS_LOG format. This is due to the use of the GREEDYDATA which offsets the patterns incorrectly. To address, will probably use a [^\t\r\n] to delimit the fields.

2018-07-24 22:22:47 SEA19 557 196.52.43.106 GET d2lv5my8ejglq4.cloudfront.net / 301 - Mozilla/5.0%2520(compatible;%2520nsrbot/1.0;%2520&%2343;http://netsystemsresearch.com) - - Redirect IKaDjLqf5T8ptafxZk_HNJ49zZ1N4SuI8f_kdivoUvPNZFnzpuKhKA== jamespleger.com http 127 0.000 - - - Redirect HTTP/1.1 - -


- Steps to Reproduce: add cloudfront logs.

Reference:
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html
jpleger commented 6 years ago

Will submit a PR shortly to fix this.

jpleger commented 6 years ago

I think looking at the patterns that are in common use in the grok patterns, should add a NOTTAB, which can help address the problem in the future if aws adds new fields.

This should solve it:

# patterns/grok-patterns
NOTTAB [^\t\r\n]+

# patterns/aws
CLOUDFRONT_ACCESS_LOG (?<timestamp>%{YEAR}-%{MONTHNUM}-%{MONTHDAY}\t%{TIME})\t%{WORD:x_edge_location}\t(?:%{NUMBER:sc_bytes:int}|-)\t%{IPORHOST:clientip}\t%{WORD:cs_method}\t%{HOSTNAME:cs_host}\t%{NOTSPACE:cs_uri_stem}\t%{NUMBER:sc_status:int}\t%{NOTTAB:referrer}\t%{NOTTAB:agent}\t%{NOTTAB:cs_uri_query}\t%{NOTTAB:cookies}\t%{WORD:x_edge_result_type}\t%{NOTSPACE:x_edge_request_id}\t%{HOSTNAME:x_host_header}\t%{URIPROTO:cs_protocol}\t%{INT:cs_bytes:int}\t%{NOTTAB:time_taken:float}\t%{NOTTAB:x_forwarded_for}\t%{NOTTAB:ssl_protocol}\t%{NOTTAB:ssl_cipher}\t%{NOTTAB:x_edge_response_result_type}\t%{NOTTAB:cs_protocol_version}(?:\t%{NOTTAB:fle_status}\t%{NOTTAB:fle_encrypted_fields})?```
jsvd commented 6 years ago

@jpleger I'm happy to merge such a PR if you get to create it (also, there are tons of examples of how to write a test for the pattern so please include that as well in the PR)

thepatrick commented 6 years ago

I've also started seeing x_edge_location not match on WORD - it can contain hyphens.

julien-c commented 4 years ago

Yes, I can confirm that this is broken. => _grokparsefailure

Also AWS doc on how to plug Cloudfront logs into Logstash isn't correct either: https://aws.amazon.com/premiumsupport/knowledge-center/cloudfront-logs-elasticsearch/

(fails with new log fields)

tsacha commented 4 years ago

Hey,

I was working on parsing the new data fields.

For information, Amazon changelog is here: https://aws.amazon.com/about-aws/whats-new/2019/12/cloudfront-detailed-logs/

I'm using this pattern:

%{DATE_EU:date}\t%{TIME:time}\t(?<x_edge_location>\b[\w\-]+\b)\t(?:%{NUMBER:sc_bytes:int}|-)\t%{IPORHOST:c_ip}\t%{WORD:cs_method}\t%{HOSTNAME:cs_host}\t%{NOTSPACE:cs_uri_stem}\t%{NUMBER:sc_status:int}\t%{NOTTAB:referrer}\t%{NOTTAB:user_agent}\t%{NOTTAB:cs_uri_query}\t%{NOTTAB:cookie}\t%{WORD:x_edge_result_type}\t%{NOTSPACE:x_edge_request_id}\t%{HOSTNAME:x_host_header}\t%{URIPROTO:cs_protocol}\t%{INT:cs_bytes:int}\t%{NUMBER:time_taken:float}\t%{NOTTAB:x_forwarded_for}\t%{NOTTAB:ssl_protocol}\t%{NOTTAB:ssl_cipher}\t%{NOTTAB:x_edge_response_result_type}\t%{NOTTAB:cs_protocol_version}\t%{NOTTAB:fle_status}\t%{NOTTAB:fle_encrypted_field}(\t%{INT:c_port:int}\t%{NUMBER:time_to_first_byte:float}\t%{NOTTAB:x_edge_detailed_result_type}\t%{NOTTAB:sc_content_type}\t(?:%{NUMBER:sc_content_len:int}|-)\t(?:%{NUMBER:sc_content_start:int}|-)\t(?:%{NUMBER:sc_content_end:int}|-))?

With the following pattern mentioned by @jpleger

NOTTAB [^\t\r\n]+
kares commented 3 years ago

expected to be addressed by the updated ECS compliant aws pattern set from #287 the wrong behaviour of the legacy CLOUDFRONT_ACCESS_LOG is spec-ed.