influxdata / go-syslog

Blazing fast syslog parser
MIT License
476 stars 69 forks source link

RFC3164: square brackets in the message screw parsing of ProcID #31

Open dimonomid opened 4 years ago

dimonomid commented 4 years ago

Parser created as follows:

        p := rfc3164.NewParser(
            rfc3164.WithYear(rfc3164.Year{YYYY: 2020}),
            rfc3164.WithRFC3339(),
        )

Example input:

<0>Mar  1 09:38:48 myhost myapp[12345]: foo [bar] baz

Output (take a look at ProcID):

{
  "Facility": 0,
  "Severity": 0,
  "Priority": 0,
  "Timestamp": "2020-03-01T09:38:48Z",
  "Hostname": "myhost",
  "Appname": "myapp",
  "ProcID": "foo [bar",
  "MsgID": null,
  "Message": "foo [bar] baz"
}
russorat commented 4 years ago

@dimonomid thanks again!

z3bra commented 1 year ago

I also encounter a variant of this issue (within telegraf) with the following message:

Aug  2 11:54:06 hostname e Random message

Which is parsed by telegraf as:

{
  "name": "syslog",
  "fields": {
    "facility_code": 3,
    "message": "Random message",
    "procid": "servicename[48737",
    "severity_code": 5,
    "timestamp": 1690979105000000000
  },
  "tags": {
    "appname": "catsitd",
    "facility": "daemon",
    "host": "localhost",
    "hostname": "myhost",
    "severity": "notice",
    "source": "X.X.X.X"
  },
  "timestamp": 1690971905
}

Notice how the message is trimmed. It should be servicename[48737]: Random message. Either the procid field pattern is too greedy (unlikely, as it would still include the original PID otherwise), or the parsing code takes the rightmost match as the ProcID.

The relevant code seems to be in rfc3164/machine.go.rl:

# The first not alphanumeric character starts the content (usually containing a PID) part of the message part
contentval = !alnum @err(err_contentstart) >mark print* %set_content @err(err_content);

content = '[' contentval ']'; # todo(leodido) > support ':' and ' ' too. Also they have to match?

Here, contentval seems to also match any separator that would fit between the [ ] pair. I'm not familiar with this lexer though so I have no idea how to fix it.

z3bra commented 1 year ago

So I've quickly read about Ragel SM parsing, and could come up with a solution which doesn't break any test. I've also added a test case with my previous example, to confirm it's working.

In the code, the content field is apparently only used for parsing the procid. You can tell by the content = '[' contentval ']', which limits matching to what's inside the square brackets. The RFC3164 states that the CONTENT should be everything that immediately follows the TAG. In this case the ProcID is only a small part of the CONTENT, which had me confused when I first read this parser.