vectordotdev / vrl

Vector Remap Language
Mozilla Public License 2.0
126 stars 57 forks source link

parse_nginx_log fails on empty referer #643

Closed mtekel closed 6 months ago

mtekel commented 8 months ago

Hello,

it seems pattern definition for nginx common log used by parse_nginx_log function expects non-empty referer: https://github.com/vectordotdev/vrl/blob/2b39353b3236e0aac26314ad47153238c52aa2ff/src/stdlib/log_util.rs#L136C1-L136C106

As it turns out, in practice, referer can be empty, see https://stackoverflow.com/questions/6880659/in-what-cases-will-http-referer-be-empty. E.g. when the enduser

entered the site URL in browser address bar itself. visited the site by a browser-maintained bookmark. visited the site as first page in a new window/tab/session, in some browsers. clicked a link on a page having restrictive tag. clicked a link on a page having restrictive Referrer-Policy header. clicked a link having rel="noreferrer". clicked a link in an external application (i.e. not a webbrowser, e.g. Flash). switched from a https URL to a http URL. has security software installed (antivirus/firewall/etc) which strips the referrer from all requests. is behind a proxy which strips the referrer from all requests. visited the site programmatically (like, curl) without setting the referrer header (bots!).

This means that any time we get client request with empty referer, vector fails to parse nginx log line. We do get thousands of these issues each day.

Example code for vrl playground (vrl 0.9.1, vector cebe6284).

Working: https://playground.vrl.dev/?state=eyJwcm9ncmFtIjoic3RydWN0dXJlZCA9IHBhcnNlX25naW54X2xvZyEoLm1lc3NhZ2UsXCJpbmdyZXNzX3Vwc3RyZWFtaW5mb1wiKVxuLiA9IG1lcmdlKC4sIHN0cnVjdHVyZWQpXG4iLCJldmVudCI6eyJtZXNzYWdlIjoiLSAtIC0gWzAzL09jdC8yMDIzOjE0OjIxOjM2ICswMDAwXSBcIlBPU1QgLyBIVFRQLzEuMVwiIDQ5OSAwIFwiLVwiIFwiLVwiIDExMjggMC4wMDMgW3NvbWUuYWRkcmVzcy5jb21dIFstXSBodHRwcyAwIDAuMDA0IDAwNSAxMC41My4xMzQuNDcifSwiaXNfanNvbmwiOmZhbHNlLCJlcnJvciI6bnVsbH0%3D

Broken: https://playground.vrl.dev/?state=eyJwcm9ncmFtIjoic3RydWN0dXJlZCA9IHBhcnNlX25naW54X2xvZyEoLm1lc3NhZ2UsXCJpbmdyZXNzX3Vwc3RyZWFtaW5mb1wiKVxuLiA9IG1lcmdlKC4sIHN0cnVjdHVyZWQpXG4iLCJldmVudCI6eyJtZXNzYWdlIjoiLSAtIC0gWzAzL09jdC8yMDIzOjE0OjIxOjM2ICswMDAwXSBcIlBPU1QgLyBIVFRQLzEuMVwiIDQ5OSAwIFwiXCIgXCItXCIgMTEyOCAwLjAwMyBbc29tZS5hZGRyZXNzLmNvbV0gWy1dIGh0dHBzIDAgMC4wMDQgMDA1IDEwLjUzLjEzNC40NyJ9LCJpc19qc29ubCI6ZmFsc2UsImVycm9yIjpudWxsfQ%3D%3D

drmason13 commented 6 months ago

Thanks for the report. I can take a look at this one, it's simply swapping a + for a * in a regex (at first glance)