rsyslog / liblognorm

a fast samples-based log normalization library
http://www.liblognorm.com
GNU Lesser General Public License v2.1
99 stars 64 forks source link

CEF parser issues #59

Open marnao opened 9 years ago

marnao commented 9 years ago

First and foremost thank you for taking the time to develop this feature. This will make parsing CEF a whole lot more efficient for us. I love where this project is going as a whole and appreciate the responsiveness.

I'm still experimenting with the new CEF parser but I wanted to report a couple issues I've seen so far.

1) This one is relatively straight forward. The first character after the last pipe in the mandatory fields is ignored. In your example you have a space there, so it was immediately obvious why it was happening on our logs.

$ echo "CEF:0|Vendor|Product|Version|Signature ID|some name|Severity|aa=field1 bb=this is a value cc=field 3" | lognormalizer -e json -r cef.db | jq .cef.Extensions { "a": "field1", "bb": "this is a value", "cc": "field 3" }

2) This one is a little strange to me. We usually put our syslog template as a prefix for all of our rules. It took me a while to narrow down why the CEF parser wasn't working here, but I narrowed it down to the date field. Example below:

This fails with our standard prefix: $ cat cef.db prefix=%facility:word% %priority:char-to:\x3A%: %date:date-rfc5424% %sensor_host:word% %tag:char-to:\x3A%:\x20 rule=:%cef:cef%

$ echo "local0 info: 2015-05-05T03:31:04.364405+00:00 sysloghost syslogtag: CEF:0|Vendor|Product|Version|Signature ID|some name|Severity|aa=field1 bb=this is a value cc=field 3" | lognormalizer -e json -r cef.db | jq . { "originalmsg": "local0 info: 2015-05-05T03:31:04.364405+00:00 sysloghost syslogtag: CEF:0|Vendor|Product|Version|Signature ID|some name|Severity|aa=field1 bb=this is a value cc=field 3", "unparsed-data": "CEF:0|Vendor|Product|Version|Signature ID|some name|Severity|aa=field1 bb=this is a value cc=field 3" }

It works fine when I use rest instead of CEF for the type: $ echo "local0 info: 2015-05-05T03:31:04.364405+00:00 sysloghost syslogtag: CEF:0|Vendor|Product|Version|Signature ID|some name|Severity|aa=field1 bb=this is a value cc=field 3" | lognormalizer -e json -r cef.db | jq . { "cef": "CEF:0|Vendor|Product|Version|Signature ID|some name|Severity|aa=field1 bb=this is a value cc=field 3", "tag": "syslogtag", "sensor_host": "sysloghost", "date": "2015-05-05T03:31:04.364405+00:00", "priority": "info", "facility": "local0" }

Removing the date from the parser and the input resolves the issue. Note that I tried changing date to another type like "word" and it had no effect.

$ cat cef.db prefix=%facility:word% %priority:char-to:\x3A%: %sysloghost:word% %syslogtag:char-to:\x3A%:\x20 rule=:%cef:cef%

$ echo "local0 info: sysloghost syslogtag: CEF:0|Vendor|Product|Version|Signature ID|some name|Severity|aa=field1 bb=this is a value cc=field 3" | lognormalizer -e json -r cef.db | jq . { "cef": { "DeviceVendor": "Vendor", "DeviceProduct": "Product", "DeviceVersion": "Version", "SignatureID": "Signature ID", "Name": "some name", "Severity": "Severity", "Extensions": { "a": "field1", "bb": "this is a value", "cc": "field 3" } }, "syslogtag": "syslogtag", "sysloghost": "sysloghost", "priority": "info", "facility": "local0" }

davidelang commented 9 years ago

when running into issues like this, run it with -v to get details

In this case, I see:

liblognorm: 46:trying parser for field 'sensor_host': 0x7fda6332e870 liblognorm: parser returns 0, parsed 10 liblognorm: potential hit, trying subtree liblognorm: 56: prefix compare ' ', ' ' liblognorm: 57: prefix compare succeeded, still valid liblognorm: 57:trying parser for field 'tag': 0x7fda6332ea00 liblognorm: parser returns 0, parsed 9 liblognorm: potential hit, trying subtree liblognorm: 66: prefix compare ':', ':' liblognorm: 67: prefix compare ' ', ' ' liblognorm: 68: prefix compare succeeded, still valid liblognorm: 68:trying parser for field 'cef': 0x7fda63331870 liblognorm: parser returns 0, parsed 18446744073709551554 liblognorm: potential hit, trying subtree liblognorm: 6: prefix compare succeeded, still valid liblognorm: 6 no field, trying subtree char ' ': (nil) liblognorm: 6 returns 163

it looks like the parser is running off the end of the string ("parsed 18446744073709551554") and then for some reason it's finding a space and comparing it to the end of the pattern

adding %rest:rest% to cef.db returns the following

echo "local0 info: 2015-05-05T03:31:04.364405+00:00 sysloghost syslogtag: CEF:0|Vendor|Product|Version|Signature ID|some name|Severity|aa=field1 bb=this is a value cc=field 3 " | /usr/lib/lognorm/lognormalizer -e json -r cef.db |jq . { "rest": " info: 2015-05-05T03:31:04.364405+00:00 sysloghost syslogtag: CEF:0|Vendor|Product|Version|Signature ID|some name|Severity|aa=field1 bb=this is a value cc=field 3 ", "cef": { "DeviceVendor": "Vendor", "DeviceProduct": "Product", "DeviceVersion": "Version", "SignatureID": "Signature ID", "Name": "some name", "Severity": "Severity", "Extensions": { "a": "field1", "bb": "this is a value", "cc": "field 3 " } }, "tag": "syslogtag", "sensor_host": "sysloghost", "date": "2015-05-05T03:31:04.364405+00:00", "priority": "info", "facility": "local0" }

so it looks like the cef parser is moving the wrong direction through the string at some point. from the lengths of things, it looks like it's moving a pointer in the wrong direction as it's parsing everything before the "Extensions" section (62 characters from CEF through | and the space before info is 63 characters back)

vdesabou commented 9 years ago

Hi,

I've noticed another parsing issue when the CEF message ends with a name which has not value (example: "test=")

./lognormalizer -r /home/vdesabou/cef.rb -e json < /home/vdesabou/test.log
{ "originalmsg": " CEF:0|DeviceVendor|DeviceProduct|DeviceVersion|SignatureID|Name|Severity| _cefVer=1 agentZoneURI=AgentZoneURI test=", "unparsed-data": "CEF:0|DeviceVendor|DeviceProduct|DeviceVersion|SignatureID|Name|Severity| _cefVer=1 agentZoneURI=AgentZoneURI test=" }
ontholerian commented 8 years ago

I have submitted a PR (#181) that addresses many of these issue in the v2 version of the parser engine.

@marnao: Your first point is fixed in my PR. There was an additional counter increase after the header was parsed, supposedly to jump the last pipe, but the pipes were already jumped in the header field parser. Your second point was already fixed sometime prior to my taking this thread on, maybe by @davidelang after his above posting.

@vdesabou: Your point is also fixed in my PR. There was a check for at least one char after the = sign, which doesn't exist if the last argument is empty. I removed that check, as checks for EOL were still in place and I believe still honored. This appears to have fixed this issue.

Additionally, I had a case where a field was double-quoted and contained equal signs within, fooling the CEF parser into thinking it came across a new arg prematurely. I added some more complex logic to grab all of the chars within two unescaped double-quotes as the value, and to not include the double-quotes in the JSON value result. I assume that any double-quotes only appear at the beginning and ending of the value, if this is not the case than that logic should be double-checked and fixed.

All: please review and verify the fixes.

janmejay commented 8 years ago

It'll be really helpful if you can add some tests to cover the behavior that is changed or fixed. It'll also ensure we don't accidentally break it again in the future, while functionality documenting in the PR the behavioral changes.

Regards, Janmejay

PS: Please blame the typos in this mail on my phone's uncivilized soft keyboard sporting it's not-so-smart-assist technology.

On Dec 2, 2015 7:22 AM, "ontholerian" notifications@github.com wrote:

I have submitted a PR (#181 https://github.com/rsyslog/liblognorm/pull/181) that addresses many of these issue in the v2 version of the parser engine.

@marnao https://github.com/marnao: Your first point is fixed in my PR. There was an additional counter increase after the header was parsed, supposedly to jump the last pipe, but the pipes were already jumped in the header field parser. Your second point was already fixed sometime prior to my taking this thread on, maybe by @davidelang https://github.com/davidelang after his above posting.

@vdesabou https://github.com/vdesabou: Your point is also fixed in my PR. There was a check for at least one char after the = sign, which doesn't exist if the last argument is empty. I removed that check, as checks for EOL were still in place and I believe still honored. This appears to have fixed this issue.

Additionally, I had a case where a field was double-quoted and contained equal signs within, fooling the CEF parser into thinking it came across a new arg prematurely. I added some more complex logic to grab all of the chars within two unescaped double-quotes as the value, and to not include the double-quotes in the JSON value result. I assume that any double-quotes only appear at the beginning and ending of the value, if this is not the case than that logic should be double-checked and fixed.

All: please review and verify the fixes.

— Reply to this email directly or view it on GitHub https://github.com/rsyslog/liblognorm/issues/59#issuecomment-161153927.

ontholerian commented 8 years ago

I am unfamiliar with your testing suite or how to use it, so I do not know how to add tests and make sure they work. Any pointers?

janmejay commented 8 years ago

Have a look at one of the existing tests. It's pretty straightforward. We setup a few rules in rulebase, then pass some input to be parsed, and assert upon result produced.

New tests can be added by introducing a new test definition under tests directory and adding ref to automake file(TESTS).

Regards, Janmejay

PS: Please blame the typos in this mail on my phone's uncivilized soft keyboard sporting it's not-so-smart-assist technology.

On Dec 2, 2015 7:22 PM, "ontholerian" notifications@github.com wrote:

I am unfamiliar with your testing suite or how to use it, so I do not know how to add tests and make sure they work. Any pointers?

— Reply to this email directly or view it on GitHub https://github.com/rsyslog/liblognorm/issues/59#issuecomment-161297501.

ontholerian commented 8 years ago

I have since learned that quoted fields is not a part of CEF, but something one of our applications was using. Therefore, I have removed that handling from the parser.

@janmejay: I still don't get how to run the testsuite (I haven't played too much in automake), but I have taken a try at adding a test to the existing CEF parser test script.

janmejay commented 8 years ago

If you run 'make check', all tests should run. Thanks for putting in the additional effort to do this, tests really help.

Regards, Janmejay

PS: Please blame the typos in this mail on my phone's uncivilized soft keyboard sporting it's not-so-smart-assist technology.

On Dec 4, 2015 7:19 AM, "ontholerian" notifications@github.com wrote:

I have since learned that quoted fields is not a part of CEF, but something one of our applications was using. Therefore, I have removed that handling from the parser.

@janmejay https://github.com/janmejay: I still don't get how to run the testsuite (I haven't played too much in automake), but I have taken a try at adding a test to the existing CEF parser test script.

— Reply to this email directly or view it on GitHub https://github.com/rsyslog/liblognorm/issues/59#issuecomment-161848169.