Discuss: Syntax of calculation expression

guyboertje commented 6 years ago

I am collecting opinions on the syntax of the expression. Current: filter { math { calculate => [ [ "-", "[var1]", "[var2]", "[result]" ] ] } } Proposals:

filter { math { calculate => [ [ "[result]", "[var1]", "-", "[var2]" ] ] } } This switches the order to look more like written arithmetic.
filter { math { calculate => [ "[result] = [var1] - [var2]" ] } } This defines the expression as one string which is parsed into arrays as above.
filter { math { calculate => [ "[result1] = %{[var1]} - %{[var2]}", "[result1] = %{[result1]} / 100" ] } } This defines the expression as one string but uses interpolation and is parsed into arrays like the one above.

Currently the operands [var1] and [var2] must be fields but there is no reason why these can't be numeric values.

jsvd commented 6 years ago

minor correction, the current syntax uses words for the operation: math { calculate => [ [ "add", "a_field1", "a_field2", "a_target" ] ] }

because of this, I think I prefer the current syntax to the proposal 1), because the first element being the operation it gives a better sense of what is happening: "ok this is adding var1 to var2 and placing the result in var3"

my concern with the string expressions (2 and 3) is providing good feedback when the user tries something insane :)

guyboertje commented 6 years ago

The string should parse a regex during register (or better still a custom validator so it parses at config validation) I am ambivalent about the order of the multi element array version.

guyboertje commented 6 years ago

This sort of regex

/^(?<result>[^ =]+)\s*?=\s*?(?<arg1>\S+)\s*?(?<op>\+|-|\*{1,2}|\/|\%|fdiv)\s*?(?<arg2>\S+)\s*?$/

with results

Match 1
Full match  0-22  `[result1] = [var1] - 6`
Group `result`  0-9 `[result1]`
Group `arg1`  12-18 `[var1]`
Group `op`  19-20 `-`
Group `arg2`  21-22 `6`
Match 2
Full match  23-47 `[result1] = var1 ** var2`
Group `result`  23-32 `[result1]`
Group `arg1`  35-39 `var1`
Group `op`  40-42 `**`
Group `arg2`  43-47 `var2`
Match 3
Full match  48-71 `[result1] = var1 / var2`
Group `result`  48-57 `[result1]`
Group `arg1`  60-64 `var1`
Group `op`  65-66 `/`
Group `arg2`  67-71 `var2`
Match 4
Full match  72-100  `[result1] = [var1] fdiv var2`
Group `result`  72-81 `[result1]`
Group `arg1`  84-90 `[var1]`
Group `op`  91-95 `fdiv`
Group `arg2`  96-100  `var2`
Match 5
Full match  101-128 `[result1] = [var1] % [var2]`
Group `result`  101-110 `[result1]`
Group `arg1`  113-119 `[var1]`
Group `op`  120-121 `%`
Group `arg2`  122-128 `[var2]`
Match 6
Full match  129-151 `result1] = var1 * var2`
Group `result`  129-137 `result1]`
Group `arg1`  140-144 `var1`
Group `op`  145-146 `*`
Group `arg2`  147-151 `var2`
Match 7
Full match  152-171 `foo = [var1] + var2`
Group `result`  152-155 `foo`
Group `arg1`  158-164 `[var1]`
Group `op`  165-166 `+`
Group `arg2`  167-171 `var2`
Match 8
Full match  172-189 `foo = fdiv fdiv 5`
Group `result`  172-175 `foo`
Group `arg1`  178-182 `fdiv`
Group `op`  183-187 `fdiv`
Group `arg2`  188-189 `5`
Match 9
Full match  190-205 `bar  = 1 / foo `
Group `result`  190-193 `bar`
Group `arg1`  197-198 `1`
Group `op`  199-200 `/`
Group `arg2`  201-204 `foo`

jsvd commented 6 years ago

The string should parse a regex during register (or better still a custom validator so it parses at config validation)

Right, so any mistake made by the user would result in an error saying "invalid format for math operation", the regex failure will be very opaque. IMO I don't see a clear gain in UX in the proposals, but let's collect feedback from more folks 👍

guyboertje commented 6 years ago

We are discussing implementation details and not the merits of the syntaxes.

Better validation is possible. We can use successive regexes from left to right to better mine the errors to present to the user. This would happen once. For a string of bar = 1 / foo we

apply /^(?<result>[^ =]+)/
- if match extract match[:result] move to the end of the match, string is 1 / foo
- if no match fail validation target field not found, expected format C = a + b; got #{string}
apply ^\s*+=\s*+ extract and move or fail validation no equals sign found, expected format c = a + b; got #{string}
apply ^(?<arg1>\S+)\s*+(\+|-|\*{1,2}|\/|\%|fdiv)\s*+ extract and move or fail validation first operand not found, expected format c = A + b; got #{string}
apply ^\s*+(?<op>\+|-|\*{1,2}|\/|\%|fdiv)\s*+ extract and move or fail validation operator not found, expected format c = a OPERATOR b; got #{string} or invalid operator found, got #{operator}; expected one of #{OPERATORS.keys.join(', ')}
apply ^(?<arg2>\S+)\s*+$ extract or fail validation second operand not found, expected format c = a + B; got #{string}

guyboertje commented 6 years ago

For me no. 2 is easier to read.

jsvd commented 6 years ago

my concern about the implementation maps to a concern of UX; if we support "[result] = [var1] - [var2]" as a string, I think it creates an expectation that the expression is extremely flexible. I'd be tempted to try:

filter { math { calculate => [ "[result] = [var1] + [var2] + [var3]" ] } } filter { math { calculate => [ "[result] = [var1] - [var2] * 23" ] } } filter { math { calculate => [ "[result] = ([var1] - [var2]) / [var3]" ] } } filter { math { calculate => [ "[result] = [var1]^2" ] } }

and so on. If the plan is to support more complex arithmetic then we can go forward with this approach, but for the current functionality I still think the current syntax provides a reasonable UX, i.e I prefer:

math { calculate => [ [ "sub", "[var1]", "[var2]", "[result]" ] ] }
# to:
math { calculate => [ [ "[result]", "[var1]", "-", "[var2]" ] ] }

colinsurprenant commented 6 years ago

different idea: why not use an RPN compatible notation so complex math would be possible and the interpretation would be simpler to implement?

guyboertje commented 6 years ago

@jsvd

I'd be tempted to try...

However, your expectation of a flexible syntax will have come from seeing an example in, say, discuss or a SO post and not from reading the docs or an Elastic blog post. You would get a validation error and then have to read the docs or seek help on discuss.

But I do get your point. The current syntax reigns in the expectation better.

However, this completely sets expectations:

math {
  calculate => [ 
    {
      operator => "sub" # or "-"
      lhs_field => "[var1]", # or left_field
      rhs_field => "[var2]",
      target => "[result]" 
    },
    {
      target => "[result]"
      lhs_field => "[var3]"
      operator => "/" # or "div"
      rhs_literal => 100
    }
  ]
}

guyboertje commented 6 years ago

@colinsurprenant

😄 Interesting idea (I considered it, no really, I did) but I fear very few users understand RPN. [btw, the smiley is a ref to stuff we were talking about earlier] I for one, very much do - having a) fixed HP scientific calculators way back in the Dark Ages and b) learned some form of programming with Forth which is exclusively a RPN + stack based language. an example I presume...

filter { math { calculate => [ "[var1] [var2] - [var3] / '[result]' =" ] } }

means

put [var1] and [var2] on the stack
execute - operator by taking two elements off of the stack, subtracting them and put the result on the stack
put [var3] on the stack
execute / operator by taking two elements off of the stack, dividing them and put the result on the stack
put the literal string [result] on the stack
execute = by taking two elements off of the stack and setting the deeper as value to the shallower as field i.e.event.set(stack[0], stack[-1])

@jsvd & @colinsurprenant For me the crucial judgement call now is that, if we go with a collection based expression now, how hard will it be to migrate code and users to a string based expression (if and) when its asked for? Co-question, can (should) we support both syntaxes?

colinsurprenant commented 6 years ago

@guyboertje there is a slight learning curve but maybe the usage flexibility and implementation simplicity is worth it? maybe more like this?

filter { math { target => "[result]", calculate => "[var1] [var2] - [var3] /" } }

so whatever is at the top of the stack is put in target.

guyboertje commented 6 years ago

Variation the not RPN string expression: filter { math { target => "[result]", calculate => "[var1] - [var2]" } }

Complex RPN example (what is the calculation): filter { math { target => "[result]", calculate => "[var1] [var2] add [var3] [var4] subtract multiply" } } Would we support stack operations like swap? Getting the reciprocal of a calculated value. "1 [var1] [var2] add [var3] [var4] subtract multiply divide" vs "[var1] [var2] add [var3] [var4] subtract multiply 1 swap divide"

jsvd commented 6 years ago

To sum up my opinion:

I think RPN is hard for a human to parse, it was made for computers to evaluate expressions with a stack.
The infix notation w/ parenthesis support is the best if the plugin is going to support nested expressions and more complex math, e.g. "([field1]+[field2])*2-[field3]^2". Humans are used to reading this.
If we only support a single operation then I suggest the current prefix (polish) notation with an array and named operations.

colinsurprenant commented 6 years ago

I think RPN is hard for a human to parse, it was made for computers to evaluate expressions with a stack.

I digress but wanted to note that some of the best calculators, operated by humans, where HPs using RPN 😁

jakelandis commented 6 years ago

How about this syntax ?

filter { 
    math { 
        add =>  {
          "result1" => ["%{var1}", "%{var2}"]
                  "result2" => ["%{result1}", "%{var3}"]
        }
        subtract => {
          "result3 => ["%{result2}", 99]
        }

We could allow only 2 elements in the array, or just define the order of operations left to right. It's basically the original syntax moved around some.

webmat commented 6 years ago

My assumption here is that people are looking to do relatively straightforward math in their logging pipeline. Perhaps adjust a value by a certain factor, or derive a new value based on two fields. Therefore I don't think we need to support arbitrary complexity at all (they can use ruby for that). This is what guides my opinion below.

I'll start with those I'm not sure about:

I also think RPN will confuse people, and opens the door to arbitrarily complex expressions.
The single string approaches: dragons lie that way. As @jsvd mentioned, people will expect all kinds of arbitrary complexity to be supported. And we'll be stuck implementing a parser one way or another.
Current approach with the leading operator, inside the same array as the operands, I'm lukewarm but that may work. Especially if we cap the array to 3 values: operator & 2 operands. Otherwise people will expect arbitrary complexity and we need to implement a parser there too (operator precedence).

In any case, I like the idea of having target => "[field_name]" over having the target field name as part of the expression (array or string), whether or not it's interpolated. target is almost always how we define the destination field of a filter. Following this convention is a big plus here, IMO.

With all of this in mind, I think my favourite options are:

either go with @jakelandis' suggestion as is.
- It doesn't use the target convention, but the notation is relatively compact and straightforward.
- We can also have add, multiply and subtract supporting arbitrary length arrays. No need to cap them those to 2 operands.
If we want to stick with the target convention
- we can either tweak Jake's approach (but there target is a key in the hash, not a filter attribute, which may lead to confusion
- Or we can go with a strict 3 element arrays and a target attribute. Readable and straightforward.

guyboertje commented 6 years ago

@jakelandis I have been bitten before with using values, e.g. add and result1 as keys because they can't be defined twice. The simplicity of the current array of arrays is that it allows for left to right sequenced calculation where the later calcs use results from the earlier ones. In your example, doing a sequenced add divide add is not possible, however I do understand the calculation might be transformable to suit.

@webmat For the sequencing reasons given above, a single target setting might be problematic. It would mean that we need to introduce the idea of registers that hold temporary results. In my mind, the register is an array that is the same size as the calculation array where index 0 stores the result of calculation 0 and so on with the target field being filled with the value of the last element. Users refer to temporary results in other calculations using (I suggest) a $1 -> index 0 etc. Registers allow for a better error message, e.g. user refers to a register with a nil value as a opposed to (in using the event as a temporary store) a field with a nil value, any field could have a nil value, we would need to track field names used for temporary results in order to give the same kind of error message, "Register $3 is empty, did you mean $0, $1 or $2? These registers have values: 0, -9 and 42".

But this only applies when calculations are not independent. If you want to do several separate calculations on different fields then a target is needed for each separate calculation.

But these sequences and separates with registers can be catered for:

filter {
  math {
    calculate => [
      { 
        calculation => [
          ["+", "[var1]", "[var2]"], ["-", "[var3]", "[var4]"], ["*", "$1", "$2"]
        ],
        target => "[result1]" # <- register[-1]
      },
      {
        calculation => [
          ["-", "[ts_job_end]", "[ts_job_start]"]
        ],
        target => "[job_duration]" # <- register[-1]
      }
    ]
  }
}

guyboertje commented 6 years ago

OK I have definitely gone off the string expression style.

We could support a layout of prefix | infix | postfix for users preferred style of expression - it tells us where the operator is to be found.

guyboertje commented 6 years ago

Also, what is the consensus on enforcing bracket field references?

Doing so means we differentiate between "[$1]" being a field reference and "$1" being a register reference without ambiguity.

webmat commented 6 years ago

Once again, I'm all for conventions. I have been confused in the past with fields being referenced sometimes in brackets and sometimes not. I would go for the brackets here.

guyboertje commented 6 years ago

I also can't see any merit in supporting interpolation in operand strings e.g. ["+", "%{[var1]}", "%{[var2]}"]

colinsurprenant commented 6 years ago

I also can't see any merit in supporting interpolation in operand strings e.g. ["+", "%{[var1]}", "%{[var2]}"]

well, since values can only be numeric I don't see how using string interpolation would be useful? are string operands possible??

OTOH just want to point out that in general there is no enforcement for using bracketed field reference (for root fields obviously) so I believe it would be the only place where this is strictly required? I don't really see a problem with that but could be confusing to some users that won't catch the note about being it necessary.

guyboertje commented 6 years ago

To clarify, requiring brackets for field names is really only needed if we introduce registers.

However, if a naming convention for register addresses can be devised that is intuitive and has a low collision potential with a users event field names (which are usually descriptive and not cryptic) then we don't need to enforce bracketed field names.

Something like one of these R[1], R(9), R{0}, ]1[, $[1], $(1), ${1}, %[0], &[1], &(0) springs to mind. Basically, some prefix with an integer surrounded by brackets of some description. Perhaps a separator based approach works too , R|0, R\0, R^0, although these are not so easy on the eye. Another question, is zero based register index numbers intuitive to users that are not developers?

guyboertje commented 6 years ago

Conclusion:

I an going to stick with the current array based syntax but add registers.

logstash-plugins / logstash-filter-math

Discuss: Syntax of calculation expression #3