vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.65k stars 1.56k forks source link

Request for parse_aws_cloudfront_log or parse_w3c_extended_logfile function #16589

Open irvintim opened 1 year ago

irvintim commented 1 year ago

The documentation for parse_regex suggests opening a ticket to request a parse_* function be added for a log format that isn't already available.

My request is for AWS Cloudfront Logs, the format is W3C Extended Log File format. Which is made up of tab-delimited log lines with 2 "comment" lines prepended with "#" at the top of the file, one with a version # and one with a list of the fields. https://www.w3.org/TR/WD-logfile.html

And is further defined on this doc.: https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html

e.g.:

#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl-cipher x-edge-response-result-type cs-protocol-version fle-status fle-encrypted-fields c-port time-to-first-byte x-edge-detailed-result-type sc-content-type sc-content-len sc-range-start sc-range-end
2023-02-25  06:06:53    LAX50-P3    429 129.146.75.29   GET xxxx.cloudfront.net /menu/guiw  200 -   Mozilla/5.0%20(Windows%20NT%2010.0)%20AppleWebKit/537.36%20(KHTML,%20like%20Gecko)%20Chrome/89.0.4389.114%20Safari/537.36   id=3&nsbrand=1&nsvpx=phpinfo&protocol=nonexistent.1337%22%3E    startupapp=st   LambdaGeneratedResponse O0woGPklvWn5Hx7ahYw1uQdUBF-CrSBEl8EK-kgi4u-Tacc1df1Skg==    foo.bar.com https   330 0.115   -   TLSv1.3 TLS_AES_128_GCM_SHA256  LambdaGeneratedResponse HTTP/1.1    -   -   35024   0.115   LambdaGeneratedResponse -   0   -   -
2023-02-25  06:06:54    LAX50-P3    434 129.146.75.29   GET xxxx.cloudfront.net /.//WEB-INF/web.xml 200 -   -   --  LambdaGeneratedResponse 0c6ffuHcpguIKuR1kOgR-fE0IRp2-qbuL_AqhtcrgIWSxHbKoN3jCA==    foo.bar.com https   100 0.113   -   TLSv1.3 TLS_AES_128_GCM_SHA256  LambdaGeneratedResponse HTTP/1.1    -   -   35032   0.113   LambdaGeneratedResponse-0
lens0021 commented 7 months ago

For what it's worth

parse_regex!(.message, r'(?P<date>[^\t]+)\t(?P<time>[^\t]+)\t(?P<x_edge_location>[^\t]+)\t(?P<sc_bytes>[^\t]+)\t(?P<c_ip>[^\t]+)\t(?P<cs_method>[^\t]+)\t(?P<cs_host>[^\t]+)\t(?P<cs_uri_stem>[^\t]+)\t(?P<cs_status>[^\t]+)\t(?P<cs_referer>[^\t]+)\t(?P<cs_user_agent>[^\t]+)\t(?P<cs_uri_query>[^\t]+)\t(?P<cs_cookie>[^\t]+)\t(?P<x_edge_result_type>[^\t]+)\t(?P<x_edge_request_id>[^\t]+)\t(?P<x_host_header>[^\t]+)\t(?P<cs_protocol>[^\t]+)\t(?P<cs_byte>[^\t]+)\t(?P<time_taken>[^\t]+)\t(?P<x_forwarded_for>[^\t]+)\t(?P<ssl_protocol>[^\t]+)\t(?P<ssl_cipher>[^\t]+)\t(?P<x_edge_response_result_type>[^\t]+)\t(?P<cs_protocol_version>[^\t]+)\t(?P<fle_status>[^\t]+)\t(?P<fle_encrypted_fields>[^\t]+)\t(?P<c_port>[^\t]+)\t(?P<time_to_first_byte>[^\t]+)\t(?P<x_edge_detailed_result_type>[^\t]+?)\t(?P<cs_content_type>[^\t]+)\t(?P<sc_content_len>[^\t]+)\t(?P<sc_range_start>[^\t]+)\t(?P<sc_range_end>[^\t]+)')

(edited)

fdamstra commented 3 months ago

Parsing using the regex only works if the format is consistent. The point of w3c logs is that they can be customized.

IIS logs also follow this format, and we see variations using customer IP in the third field and server IP in the third field coming from logs from the same server.