thekrakken / java-grok

Simple API that allows you to easily parse logs and other files
http://grok.nflabs.com/
Other
358 stars 152 forks source link

Dropped support for named group captures with underscore #108

Open kmerz opened 5 years ago

kmerz commented 5 years ago

With #53 we lost the ability to have named group captures with underscore like (?<test_field>test).

java-grok had the support as long as we used the com.google.code.regexp.Pattern. Now with java.util.regex.Pattern we use the java regex engine which does not support underscores:

https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#groupname

A capturing group can also be assigned a "name", a named-capturing group, and then be back- referenced later by the "name". Group names are composed of the following characters. The first character must be a letter. The uppercase letters 'A' through 'Z' ('\u0041' through '\u005a'), The lowercase letters 'a' through 'z' ('\u0061' through '\u007a'), The digits '0' through '9' ('\u0030' through '\u0039'),

This broke backward compatibility with already stored patterns.

Preferable fix was to bring back com.google.code.regexp.Pattern.

kmerz commented 5 years ago

Here a diff you can apply which tests the missing underscore support:

diff --git a/src/main/resources/patterns/patterns b/src/main/resources/patterns/patterns
index 52235d8..2eb7233 100644
--- a/src/main/resources/patterns/patterns
+++ b/src/main/resources/patterns/patterns
@@ -106,3 +106,6 @@ COMMONAPACHELOG_DATATYPED %{IPORHOST:clientip} %{USER:ident;boolean} %{USER:auth

 # Log Levels
 LOGLEVEL ([A|a]lert|ALERT|[T|t]race|TRACE|[D|d]ebug|DEBUG|[N|n]otice|NOTICE|[I|i]nfo|INFO|[W|w]arn?(?:ing)?|WARN?(?:ING)?|[E|e]rr?(?:or)?|ERR?(?:OR)?|[C|c]rit?(?:ical)?|CRIT?(?:ICAL)?|[F|f]atal|FATAL|[S|s]evere|SEVERE|EMERG(?:ENCY)?|[Ee]merg(?:ency)?)
+
+# NamedGroup with underscore
+NAMEDGROUPWITHUNDERSCORE (?<test_field>test)
diff --git a/src/test/java/io/krakens/grok/api/GrokTest.java b/src/test/java/io/krakens/grok/api/GrokTest.java
index de4d714..3bff5a8 100644
--- a/src/test/java/io/krakens/grok/api/GrokTest.java
+++ b/src/test/java/io/krakens/grok/api/GrokTest.java
@@ -672,4 +672,13 @@ public class GrokTest {
     instant = (Instant) grok.match(dateWithTimeZone).capture().get("timestamp");
     assertEquals(ZonedDateTime.parse(dateWithTimeZone, dtf.withZone(ZoneOffset.ofHours(8))).toInstant(), instant);
   }
+
+  @Test
+  public void testNamedGroupWithUnderscore() {
+    String grokPatternName = "NAMEDGROUPWITHUNDERSCORE";
+    String testString = "test";
+    Grok grok = compiler.compile("%{" + grokPatternName + "}");
+    String result = (String) grok.match(testString).capture().get(grokPatternName);
+    assertEquals("test", result);
+  }
 }
ottobackwards commented 5 years ago

The fact that this broke _ support was known at the time of the change and referenced in #53, so this was on purpose and accepted as far as I can tell. I don't think you have to convince anyone this is an issue.

What you would have to convince people of is that the original PR ( 2016?) was not worth dropping this support.

kmerz commented 5 years ago

For us in Graylog it is simply the fact that we have users with Grok Patterns which do contain named groups with underscores. And dropping the support means a lot of work for the users, manually rewriting there Grok Patterns.

ottobackwards commented 5 years ago

So, from what I can tell that support was lost because the library that provides it is for java 5/6, so it just can't be 'added back'. Maybe you can submit a PR? But there might be other reasons.....

@anthonycorbacho ?

kmerz commented 5 years ago

I would have a PR ready when you want. But as I read in #53 performance with named-regexp was an issue?

ottobackwards commented 5 years ago

So we really need @anthonycorbacho to comment. I was not involved at that time, and I am not a committer now, although I have submitted a couple of PR etc. Switching the regex engine, as I believe this change did was and is a big deal, and as I said they knew they were making this breaking change and accepted it.

I don't think there is any going back.

I would imagine if you wrote something that say did mapping between and some allowable character on the fly ( so that the caller can use but inside it is replaced ) that would be the way to go.

But @anthonycorbacho leads the project.