influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.89k stars 5.6k forks source link

Logparser common log format error (nginx/apache) #1810

Closed tympanix closed 8 years ago

tympanix commented 8 years ago

Bug report

Using the logparser plugin to parse nginx access log files does not parse http basic auth requests when the username contains a digit or spaces.

Applies to both the _COMMON_LOGFORMAT and _COMBINED_LOGFORMAT grok pattern. Issue may be relevant for apache logs as well.

Relevant telegraf.conf:

# Stream and parse log file(s).
[[inputs.logparser]]
  files = ["/var/log/nginx/access.log"]
  from_beginning = false

  [inputs.logparser.grok]
    patterns = ["%{COMMON_LOG_FORMAT}"]
    measurement = "nginx_access_log"

System info:

Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u2 (2016-01-02) x86_64 GNU/Linux
Telegraf - version 1.0.0
nginx version: nginx/1.6.2

Steps to reproduce:

  1. Set up telegraf.conf file as above
  2. Echo the examples to the logfile (see additional info)
  3. Telegraf will not match the grok pattern to the log

    Expected behavior:

Telegraf matches the log file using either the _COMMON_LOGFORMAT or the _COMBINED_LOGFORMAT and passes the log onto the outputs.

Actual behavior:

When the username contains digits the log is ignored. When containing spaces words are parsed as other attributes (e.g. _clientip will be parsed as one of the words).

Additional info:

Here are some example logs that causes the error:

Using numbers in the http basic auth username:

127.0.0.1 - username123 [25/Sep/2016:00:19:43 +0200] "GET / HTTP/1.1" 401 590 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36"

Using spaces in the http basic auth username:

127.0.0.1 - my username here [25/Sep/2016:00:17:36 +0200] "GET / HTTP/1.1" 401 590 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36"
andrecrt commented 8 years ago

@Tympanix thanks for finding the bug! I was getting nuts not understanding why some requests weren't appearing on Influx. Taking that knowledge, I updated my logparser to "ignore" processing both ident and auth (see the first 2 %{DATA}) and now all requests seem to be logged properly!

[[inputs.logparser]]
  ## files to tail.
  files = ["/var/log/nginx/access.log"]
  ## Read file from beginning.
  from_beginning = true
  ## Override the default measurement name, which would be "logparser_grok"
  name_override = "nginx_access_log"
  ## For parsing logstash-style "grok" patterns:
  [inputs.logparser.grok]
    patterns = ["%{CUSTOM_LOG}"]
    custom_patterns = '''
      CUSTOM_LOG %{CLIENT:client_ip} %{DATA} %{DATA} \[%{HTTPDATE:ts:ts-httpd}\] "(?:%{WORD:verb:tag} %{NOTSPACE:request}(?: HTTP/%{NUMBER:http_version:float})?|%{DATA})" %{NUMBER:resp_code:tag} (?:%{NUMBER:resp_bytes:int}|-)
    '''
tympanix commented 8 years ago

Great solution. I've done something similar by adding digits (0-9) and spaces to the NGUSER pattern to overcome this issue. We have a potential issue where both the ident and auth contains multiple words though. You wouldn't be able to tell them apart. I've never seen this in practice though.

sparrc commented 8 years ago

@Tympanix are http ident and auth allowed to have spaces in them? I'm not sure there's anything we can do if so. I will definitely fix the case of numbers in the ident & auth for release 1.1.

tympanix commented 8 years ago

I have tested this on my own nginx server, and seemingly the http basic module does not complain when using spaces. The following is logged when using "my username here" as the username:

127.0.0.1 - my username here [25/Sep/2016:00:17:36 +0200] "GET / HTTP/1.1" 401 590 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36"

I would point out that this is an edge case. Also the password (which I assume to be the first dash) doesn't show regardless of the input. If that is always the case then we should be able to parse the log unambiguously. I don't know if this is related to apache as well.

Thank you for the commit 👍