microsoft / Kusto-Query-Language

Kusto Query Language is a simple and productive language for querying Big Data.
Apache License 2.0
510 stars 97 forks source link

`parse` regex mode failing on nested capture group patterns #144

Open jared-koiter opened 2 weeks ago

jared-koiter commented 2 weeks ago

I am trying to write a KQL query that parses some raw log data into columns for a Azure Log Analytics workspace table. However I am running into issues trying to get the parse operator in regex mode to handle the nested capture groups that I am trying to use.

The raw logs I am working with have a few fields that are optional resulting in the output format not always being consistent. Here's an example of two lines:

 a=avalue,c={x=xvalue,y=yvalue,z=zvalue}
 a=avalue,b=bvalue,c={x=xvalue,y=yvalue,z=zvalue}

My goal is to write a KQL query that will parse this example data with a regex into a table something like this:

acol bcol ccol
avalue {x=xvalue,y=yvalue,z=zvalue}
avalue bvalue {x=xvalue,y=yvalue,z=zvalue}

The regex I wanted to try and use was the following:

^a=(?P<acol>.*?)(?:,b=(?P<bcol>.*?))?,c=(?P<ccol>.*?)$

This regex works fine for me when I use it in other regex engines, and even appears to work fine with the RE2 engine when used outside of the parse operator. I tried to convert the regex into the format expected by the parse operator with the following full query:

let example = datatable(input: string) [
    'a=avalue,c={x=xvalue,y=yvalue,z=zvalue}',
    'a=avalue,b=bvalue,c={x=xvalue,y=yvalue,z=zvalue}'
];
example
| parse kind = regex flags = U input with "^a=" acol: string "(?:,b=" bcol: string ")?,c=" ccol: string "$";

When I run this query in my test Log Analytics workspace, I get the following error:

parse: failed to analyze the pattern: Invalid regex pattern: "(?:,b="
Request id: <guid>

If I switch the parse query for

| parse kind = regex flags = U input with "^a=" acol: string "(?:,b=(?P<bcol>.*?))?,c=" ccol: string "$";

then it works fine but obviously doesn't pull the bcol value that I need.

It's possible I'm misunderstanding something about how the parse operator is translating the query into a full regex, but as far as I can tell this query should be valid syntax. It almost seems like the translator is expecting the individual components of the query to all be valid regexes in their own right even though it would only make sense once fully constructed. I'm not clear on whether this issue is a bug in the parse operator itself or in the particular implementation of it within the Log Analytics workspace, but I'm hoping to get that clarification in opening this ticket. Thank you to anyone who can shed some light on what is going wrong in my example query.

mattwar commented 2 weeks ago

It is validating each separately as complete regex, but it should not be doing that. Someone is working on it.