Not possible to inject @metadata using fields section of config

jgough commented 3 years ago

With the fix for above now merged into master I'm building from master and trying to inject nested data into @metadata that would normally be added by the s3 plugin (which adds [@metadata][s3][key])

If I try and inject @metadata then it does not get set correctly. For example a test with:

fields:
  "[a][b]": "foo"
  "[@metadata][c][d]": "bar"
export_metadata: true

outputs:

{
  "a": {
    "b": "foo"
  },
  "d": "bar"
}

But I expected

{
  "a": {
    "b": "foo"
  },
  "@metadata": {
    "c": {
      "d": "bar"
    }
  }
}

Maybe instead I should be looking to mock the input plugin instead?

breml commented 3 years ago

@jgough I will have a look at this. It looks like it is resolvable, because passing @metadata works for me in the fields in testcases (nested instead of Logstash syntax) like this:

{
  "export_metadata": true,
  "testcases": [
    {
      "fields": {
        "@metadata": {
          "key": "value"
        }
      }
    }
  ]
}

breml commented 3 years ago

I looked into this a little bit more and it looks like we have two issues:

addressing @metadata fields with Logstash config syntax ([@metadata][field]) does not work.
Nested objects in global fields and per test case fields are not merged.

For the first, I think I can come up with a fix. For the second the question is, if we should support deep-merging or not. If we decided to not support deep-merging, then I think we just need to add this to the documentation. Otherwise an other fix is necessary to support deep-merging.

@magnusbaeck What do you think about deep-merging?

jgough commented 3 years ago

May I add my thoughts on deep merging? Interesting problem.

I kind of personally feel that there are use cases for setting [@metadata][field1] and [@metadata][field2] separately in global and test fields. But I feel that anything set on a per-test basis should overwrite global fields. I also feel that the top-level key in the testcase fields (be it a logstash format or a json/yaml format) should completely overwrite anything from the global fields, as the intent could be argued to set specifically that key (and anything underneath is data for that key)

So I feel the following would make most sense:

Case 1:

{
  "fields": {
    "[data][field1]": "foo"
  },
  testcases: {
    "fields": {
      "[data][field1]": "bar",
      "[data][field2]": "baz"
    }
  }
}

Should result in

"[data][field1]": "bar"
"[data][field2]": "baz"

Case 2:

{
  "fields": {
    "data": {
      "field1": "foo"
    }
  },
  testcases: {
    "fields": {
      "[data][field2]": "baz"
    }
  }
}

Result:

"[data][field1]": "foo"
"[data][field2]": "baz"

Case 3:

{
  "fields": {
    "[data][field1]": "foo"
  },
  testcases: {
    "fields": {
      "data": {
        "field2": "baz"
      }
    }
  }
}

Result:

"[data][field2]": "baz"

(data is overwritten entirely by testcase fields)

Case 4:

{
  "fields": {
    "data": {
      "field1": "foo"
    }
  },
  testcases: {
    "fields": {
      "data": {
        "field2": "baz"
      }
    }
  }
}

Result:

"[data][field2]": "baz"

(data is overwritten entirely by testcase fields)

magnusbaeck commented 3 years ago

There are legitimate use cases for both behaviors (deep merging and shallow merging) so we'll probably upset someone whatever we choose. The [field][subfield] syntax could result in deep merging while specifying a full object would overwrite the whole inherited value. That makes intuitive sense to me. Doing so should make everyone happy except that the two forms of specifying a field would no longer be equivalent. I suspect it would also be harder to implement. What do you think?

breml commented 3 years ago

OK, if I got this right, the proposal from @jgough is exactly what @magnusbaeck has sumarized in

The [field][subfield] syntax could result in deep merging while specifying a full object would overwrite the whole inherited value.

I will have a look at the current implementation in order to figure out the effort to achieve this. Of course, this distinction between json objects and Logstash syntax would need to be prominently highlighted in the documentation.

magnusbaeck commented 3 years ago

Yes, you're right! I apparently didn't fully internalize the examples.

But yes, this behavior would indeed have to be well-documented. Technically it would also be a backwards-incompatible change, so if we don't get it in for v2 we'd have to wait for v3.

breml commented 3 years ago

I think the generalized algorithm for this would go like this:

start with an empty target object
for all "non Logstash syntax" keys in the global fields object, set the key in the target object to the respective value.
for all "non Logstash syntax" keys in the global fields object, evaluate the target key and replace this key in the target object with the respective value.
for all "non Logstash syntax" keys in the testcase fields object, set the key in the target object to the respective value.
for all "non Logstash syntax" keys in the testcase fields object, evaluate the target key and replace this key in the target object with the respective value.

I think, this algorithm would handle all the test cases listed above by @jgough correctly. It is worth mentioning, that the value could be a "basic value" (e.g. string, number) or again an object (or an array). So it would be possible to do something like this:

{
  "fields": {
    "data": {
      "field1": "global1",
      "parent1": {
        "child1": "globalchild1"
      },
      "parent2": {
        "child2": "globalchild2"
      }
    }
    "[data][field2]": "global2"
  },
  testcases: {
    "fields": {
      "[data][parent1]": "parent1",
      "[data][parent2][child3]": "child3",
      "[data][parent3]": {
        "child4": "child4"
      },
      "[data][field2]": "field2",
    }
  }
}

which would result in

"[data][field1]": "global1"
"[data][parent1]": "parent1"
"[data][parent2][child2]": "globalchild2"
"[data][parent2][child3]": "child3"
"[data][parent3][child4]": "child4"
"[data][field2]": "field2"

or again in JSON, the result would look like this:

"data": {
  "field1": "global1",
  "parent1": "parent1",
  "parent2": {
    "child2": "globalchild2",
    "child3": "child3"
  },
  "parent3": {
    "child4": "child4"
  },
  "field2": "field2"
}

What do you think?

breml commented 3 years ago

I did a first iteration to implement the above specification, but it is not as straight forward as I initially though. There are a lot of edge cases which we did not yet discuss, e.g. what logstash config fields are nested in regular objects?

e.g.

{
  "fields": {
    "data": {
      "[field1]": "value",
      "[parent1][child]": "globalchild1"
    }
    "[data][field2]": "global2"
  },
...  ]
}

Currently I have the feeling, that this could end up in a bigger refactoring of the test case processing logic and I see the risk, that the we need to duplicate this code in order to preserve backwards compatibility with the standalone (V1) mode.

jgough commented 3 years ago

I've had some further thoughts on this. Maybe it would just be best to keep it simple and forbid mixing the two formats? Use one or the other - but not both. I was unaware of the non-logstash config way of doing it until recently. There's no reason I can see that anyone should need to mix them.

magnusbaeck / logstash-filter-verifier