Open dimonomid opened 4 years ago
@dimonomid thanks again!
I also encounter a variant of this issue (within telegraf) with the following message:
Aug 2 11:54:06 hostname e Random message
Which is parsed by telegraf as:
{
"name": "syslog",
"fields": {
"facility_code": 3,
"message": "Random message",
"procid": "servicename[48737",
"severity_code": 5,
"timestamp": 1690979105000000000
},
"tags": {
"appname": "catsitd",
"facility": "daemon",
"host": "localhost",
"hostname": "myhost",
"severity": "notice",
"source": "X.X.X.X"
},
"timestamp": 1690971905
}
Notice how the message is trimmed. It should be servicename[48737]: Random message
. Either the procid
field pattern is too greedy (unlikely, as it would still include the original PID otherwise), or the parsing code takes the rightmost match as the ProcID.
The relevant code seems to be in rfc3164/machine.go.rl:
# The first not alphanumeric character starts the content (usually containing a PID) part of the message part
contentval = !alnum @err(err_contentstart) >mark print* %set_content @err(err_content);
content = '[' contentval ']'; # todo(leodido) > support ':' and ' ' too. Also they have to match?
Here, contentval
seems to also match any separator that would fit between the [ ]
pair. I'm not familiar with this lexer though so I have no idea how to fix it.
So I've quickly read about Ragel SM parsing, and could come up with a solution which doesn't break any test. I've also added a test case with my previous example, to confirm it's working.
In the code, the content
field is apparently only used for parsing the procid. You can tell by the content = '[' contentval ']'
, which limits matching to what's inside the square brackets. The RFC3164 states that the CONTENT should be everything that immediately follows the TAG. In this case the ProcID is only a small part of the CONTENT, which had me confused when I first read this parser.
Parser created as follows:
Example input:
Output (take a look at
ProcID
):