Closed jpleger closed 3 years ago
Will submit a PR shortly to fix this.
I think looking at the patterns that are in common use in the grok patterns, should add a NOTTAB, which can help address the problem in the future if aws adds new fields.
This should solve it:
# patterns/grok-patterns
NOTTAB [^\t\r\n]+
# patterns/aws
CLOUDFRONT_ACCESS_LOG (?<timestamp>%{YEAR}-%{MONTHNUM}-%{MONTHDAY}\t%{TIME})\t%{WORD:x_edge_location}\t(?:%{NUMBER:sc_bytes:int}|-)\t%{IPORHOST:clientip}\t%{WORD:cs_method}\t%{HOSTNAME:cs_host}\t%{NOTSPACE:cs_uri_stem}\t%{NUMBER:sc_status:int}\t%{NOTTAB:referrer}\t%{NOTTAB:agent}\t%{NOTTAB:cs_uri_query}\t%{NOTTAB:cookies}\t%{WORD:x_edge_result_type}\t%{NOTSPACE:x_edge_request_id}\t%{HOSTNAME:x_host_header}\t%{URIPROTO:cs_protocol}\t%{INT:cs_bytes:int}\t%{NOTTAB:time_taken:float}\t%{NOTTAB:x_forwarded_for}\t%{NOTTAB:ssl_protocol}\t%{NOTTAB:ssl_cipher}\t%{NOTTAB:x_edge_response_result_type}\t%{NOTTAB:cs_protocol_version}(?:\t%{NOTTAB:fle_status}\t%{NOTTAB:fle_encrypted_fields})?```
@jpleger I'm happy to merge such a PR if you get to create it (also, there are tons of examples of how to write a test for the pattern so please include that as well in the PR)
I've also started seeing x_edge_location not match on WORD - it can contain hyphens.
Yes, I can confirm that this is broken. => _grokparsefailure
Also AWS doc on how to plug Cloudfront logs into Logstash isn't correct either: https://aws.amazon.com/premiumsupport/knowledge-center/cloudfront-logs-elasticsearch/
(fails with new log fields)
Hey,
I was working on parsing the new data fields.
For information, Amazon changelog is here: https://aws.amazon.com/about-aws/whats-new/2019/12/cloudfront-detailed-logs/
I'm using this pattern:
%{DATE_EU:date}\t%{TIME:time}\t(?<x_edge_location>\b[\w\-]+\b)\t(?:%{NUMBER:sc_bytes:int}|-)\t%{IPORHOST:c_ip}\t%{WORD:cs_method}\t%{HOSTNAME:cs_host}\t%{NOTSPACE:cs_uri_stem}\t%{NUMBER:sc_status:int}\t%{NOTTAB:referrer}\t%{NOTTAB:user_agent}\t%{NOTTAB:cs_uri_query}\t%{NOTTAB:cookie}\t%{WORD:x_edge_result_type}\t%{NOTSPACE:x_edge_request_id}\t%{HOSTNAME:x_host_header}\t%{URIPROTO:cs_protocol}\t%{INT:cs_bytes:int}\t%{NUMBER:time_taken:float}\t%{NOTTAB:x_forwarded_for}\t%{NOTTAB:ssl_protocol}\t%{NOTTAB:ssl_cipher}\t%{NOTTAB:x_edge_response_result_type}\t%{NOTTAB:cs_protocol_version}\t%{NOTTAB:fle_status}\t%{NOTTAB:fle_encrypted_field}(\t%{INT:c_port:int}\t%{NUMBER:time_to_first_byte:float}\t%{NOTTAB:x_edge_detailed_result_type}\t%{NOTTAB:sc_content_type}\t(?:%{NUMBER:sc_content_len:int}|-)\t(?:%{NUMBER:sc_content_start:int}|-)\t(?:%{NUMBER:sc_content_end:int}|-))?
With the following pattern mentioned by @jpleger
NOTTAB [^\t\r\n]+
Not sure if its a version change or log format change on the AWS side, but currently there are a few fields that are incorrect pattern-wise for the CLOUDFRONT_ACCESS_LOG format. This is due to the use of the GREEDYDATA which offsets the patterns incorrectly. To address, will probably use a [^\t\r\n] to delimit the fields.
2018-07-24 22:22:47 SEA19 557 196.52.43.106 GET d2lv5my8ejglq4.cloudfront.net / 301 - Mozilla/5.0%2520(compatible;%2520nsrbot/1.0;%2520&%2343;http://netsystemsresearch.com) - - Redirect IKaDjLqf5T8ptafxZk_HNJ49zZ1N4SuI8f_kdivoUvPNZFnzpuKhKA== jamespleger.com http 127 0.000 - - - Redirect HTTP/1.1 - -