open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.91k stars 2.27k forks source link

[pkg/ottl] RFC - Direct XML manipulation functions #35281

Open djaglowski opened 1 day ago

djaglowski commented 1 day ago

Component(s)

pkg/ottl

Is your feature request related to a problem? Please describe.

XML is frequently used in traditional logging frameworks, but within the collector and downstream tools it is often difficult to manipulate.

Before going further, I believe it would be helpful to define a term: "JSON-equivalent". Basically, a plog.LogRecord's body or attributes can be losslessly converted to or from JSON (or YAML, or some other formats).

Notably, XML is not JSON-equivalent, at least not generally. However, it is possible to define a subset of XML which is JSON-equivalent, which we could call "JSON-equivalent XML". (More on this below.)

We currently have a ParseXML function, but in order to deal with the fact that XML is not generally JSON-equivalent, we are producing an encoding of XML. The encoding is necessarily JSON-equivalent, but ultimately it is an overly verbose representation that OTTL is not well suited to manipulate in ways that respect the encoding. That means that our current strategy for parsing XML has very limited value because users find it difficult to work with in OTTL and at least in some backends.

Describe the solution you'd like

In order to better support XML, I believe we should provide the following:

  1. Some functions which work directly with XML documents, without forcing the user to work with a JSON-equivalent encoding of XML. These could lean on XML native technologies such as XPath to navigate and manipulate the documents directly.
  2. A recommended (or maybe automated) pathway for using these functions to migrate XML documents into a JSON-equivalent state.
  3. A new "JSON-equivalent XML" parser, which produces a much more useful output.

Example

Suppose we have the following XML document:

<Data foo="bar" hello="world">
    Some text
    <One>With text</One>
    <Two>
        <Three>3</Three>
        <Four>4</Four>
        <Three note="again">3</Three>
    </Two>
</Data>

In order to make this JSON-equivalent, we can't have both attributes and child elements. We also can't have raw values at the same level as child elements. A JSON-equivalent version might look something like this:

<Data>
    <foo>bar</foo>
    <hello>world</hello>
    <value>Some text</value>
    <One>With text</One>
    <Two>
        <Three>3</Three>
        <Four>4</Four>
        <Three>
            <note>again</note>
            <value>3</value>
        </Three>
    </Two>
</Data>

This can then be converted directly into a useful object:

Data:
  foo: bar
  hello: world
  value: Some text
  One: With text
  Two:
    - Three: 3
    - Four: 4
    - Three:
        note: again
        value: 3

In order to accomplish this migration, we need some functionality:

Notably, there is a reasonable amount of subjectivity here. In the example there are two instances of the Three tag, but they end up in different formats because of the presence of an attribute on one of them. This may be problematic for the user and there are likely many similar situations. I believe a general solution will require offering a set of composable functions that allow the user to make their own decisions about how to manipulate the representation into a JSON-equivalent format that meets their needs.

Describe alternatives you've considered

No response

Additional context

No response

github-actions[bot] commented 1 day ago

Pinging code owners:

TylerHelmuth commented 1 day ago

A function that can manipulate the xml string in place seems useful. That feels simpler than doing:

- set(cache["xmlMap"], ParseXML(body))
... # manipulate cache[xmlMap"]
- set(body, MarshalXML(cache["xmlMap"]))

I am not experienced enough with XML to propose what kind of functions we'd need for that. Some OTTL guidelines that may be helpful when brainstorming ideas:

djaglowski commented 6 hours ago

Thanks for your thoughts on this @TylerHelmuth. I'm thinking we could mostly rely on Converters here. They would take a target parameter, which would need to be an xml formatted string. Otherwise parameters would be things like strings which are XPaths, or names of tags to create, etc. The cache could be useful if someone wants to work on a backup of the original value, but I think they could also just incrementally overwrite the target. It might help to add more detail to the above example.

Starting from the same xml document (and assuming this is the body):

set(body, DeleteXML(body, "*//@note")) takes an XPath parameter and deletes any "note" attributes

 <Data foo="bar" hello="world">
     Some text
     <One>With text</One>
     <Two>
         <Three>3</Three>
         <Four>4</Four>
-        <Three note="again">3</Three>
+        <Three>3</Three>
     </Two>
 </Data>

set(body, ConvertXMLAttributes(body)) converts any remaining attributes into child elements.

- <Data foo="bar" hello="world">
+ <Data>
+   <foo>bar</foo>
+   <hello>world</hello>
     Some text
     <One>With text</One>
     <Two>
         <Three>3</Three>
         <Four>4</Four>
         <Three>3</Three>
     </Two>
 </Data>

set(body, WrapFloatingXMLValues(body, "value")) finds instances where values exist at the same level as elements, and wraps them in a tag with the specified name

 <Data>
     <foo>bar</foo>
     <hello>world</hello>
-     Some text
+     <value>Some text</value>
     <One>With text</One>
     <Two>
         <Three>3</Three>
         <Four>4</Four>
         <Three>3</Three>
     </Two>
 </Data>

Then finally set(body, ParseSimplifiedXML(body)) just converts the simplified (JSON-equivalent) xml string into an attributes map.

If I'm not mistaken, the could compose these inline, but it's not clear to me if there's much benefit to this. Personally I would just use separate statements:

set(body, ParseSimplifiedXML(WrapFloatingXMLValues(ConvertXMLAttributes(DeleteXML(body, "*//@note")), "value")))

Either way, I'm not necessarily proposing the exact Converters in this example, but I think these are pretty close to what we'd need in the short term. Just wanted to articulate better how I imagine the user would incrementally convert their xml into a JSON-equivalent format, and ultimately to a clean attributes map.

crobert-1 commented 2 hours ago

Removing needs triage as a code owner has responded approving the idea.

djaglowski commented 1 hour ago

I've opened https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/35301 with the first concrete direct-xml manipulation converter as described above. If this looks good, I'll add a few more in the coming days and start work on the JSON-equivalent XML parser.