Consistent config interface for filtering external datasets

dmitryax commented 1 year ago

Problem

We use many different filtering interfaces in the collector configuration where we need to provide users with an option to limit an external set of data. The interfaces differ pretty significantly, making it confusing for the end user. Moreover, developers may find it unclear which interface to adopt for their components moving forward.

Here we are talking only about filtering external sets, not pdata objects which are supposed to be fully handled by OTTL.

There are two main use cases:

1. Just filter external data sets prior to further processing.

This scenario arises when users need to limit the external dataset before the collector proceeds with specific processing.

1.1 List of glob file paths:

Example: pkg/stanza/fileconsumer/matcher.Include/Exclude taking slices of glob strings to match against files to watch

include:
  - /var/log/pods/*/*/*.log
exclude:
  - /var/log/pods/*/otel-collector/*.log

While effective for file paths, this method might not be universally applicable to other data sources.

1.2 Filtering by strict/regex rules provided explicitly

This method is widely employed in the filtering processor, soon to be substituted by OTTL. Additionally, it has been implemented in the hostmetrics receiver with this structure:

<include_devices|exclude_devices>:
  devices: [ <device name>, ... ]
  match_type: <strict|regexp>

It was introduced in https://github.com/open-telemetry/opentelemetry-collector/pull/522 not to break the interfaces that were already built by that time. This will be fully replaced by OTTL in the filter processor and by https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/25134 in hostmetrics receiver.

1.3 Dynamically deducting regexp/glob/strict type from the string

This approach is used in dockerstats receiver exclude_images

excluded_images:
  - undesired-container
  - /.*undesired.*/
  - another-*-container

Strings wrapped with / represent regex, strings prefixed with ! represent negation, glob strings are automatically deducted if any the *?[]{}! characters are found in the string.

This approach is the least verbose but requires adding some additional escaping rules, e.g. if we want to do a strict match against /some-string/ or !string. Also, the unclear distinction between glob and regular strings can cause confusion.

2. Convert external maps into a new set of attributes

This functionality is needed when users desire a configuration interface to map external maps into resource attributes. Examples include:

A subset of Kubernetes pod/namespace annotations and labels mapped to resource attributes
A subset of container labels and environment variables mapped to resource attributes

The filtering part of this is the same as in the use case (1), but we also need to provide a way for users to remap keys from the original map to the attribute keys. This cannot be done in the downstream component because different sources (like pod annotations and labels) can have the same key, so users need a way to avoid keys from one map being overridden by another map. The mapping part can also be solved as a separate configuration option, but it's nice to do it at once, especially when we need to map regex-mapped groups. For example, k8sattributes processor currently provides the following interface for this:

extract:
  labels:
    - key_regex: (service-.*)
      from: pod
      tag_name: "k8s.pod.labels.$1"

Goal

Resolving this issue does not require changing all the existing filtering interfaces. The goal is to establish a prescribed way to define the filtering interfaces for external datasets for any future use cases. This issue would also be a prerequisite for resolving the following ones:

Possible solutions

Option A

Keep using approach 1.3 "Dynamically deducting regexp/glob/strict type from the string".

Pros

It's concise. We can support both use cases (1) and (2), where, for the use case (1), we provide a config interface as a simple list:

filter:
  - exact-string
  - /service-.*/

For the use case (2), we cannot use a map because we need to preserve the order of the filtering/mapping rules evaluation, so we need to keep it as a list of structs with two fields (filter_and_map is just a placeholder):

filter_and_map:
  - source_key: source-string
    target_key: target-string
  - source_key: /service-(.*)/
    target_key: $1

Cons

Require a way to escape of wrapping / for regexp and the ! negation prefix. Potentially, we can avoid supporting the negation ! sign because it can always be handled by a regexp. Also, we can recommend using regex for strings wrapped with /'s, e.g. /\/...\//.
Unclear distinction between strict and glob modes, e.g. users can add a single ? sign as part of the string they want to match, but it'll match any character. Potentially we can remove glob expression support and keep only /regex/ and strict-string matching

Option B

Make every filtering item a struct that would include a matching type as a field name: strict, glob regexp. Setting more than one of them shouldn't pass validation. To support use case (2), we can simply add another optional field, target_key, for example (note that filter and filter_and_map are just placeholders, they can be anything else like include_container_values):

filter:
  - strict: my-service
  - regexp: service-v1.(.*)

filter_and_map:
  # match only "source-key" key and assign its value to the "target-key" key in the target map
  - strict: source-key
    target_key: target-key
  # identify keys that match the regular expression 'service-(.*)' and assign 
  # their corresponding values to keys named after the matched group names in the target map
  - regexp: service-(.*)
    target_key: $1
  # identify keys matching the regular expression 'app-.*' 
  # and set the corresponding value under the same keys in the target map
  - regexp: app-.*

Pros

Explicit about a matching method

Cons

More verbose than Option A

Option C

Utilize OTTLP for the external datasets additionally to OTLP data. This would require introducing additional identifiers for the external items. It needs more discussion to determine its pros and cons

TylerHelmuth commented 1 year ago

OTTL should definitely be the de facto solution when making decisions on data that has already be turned into pdata, but for situations where the filter configuration is modifying how the receiver collects data in the first place it gets trickier.

If we want a truly uniform filtering experience across all components making decision at any moment I believe the best solution is OTTL. It will be the most robust, allowing users to build any type of condition they need. If we use OTTL we won't need to worry about feature parity between 2 condition systems.

We have added a lot of plumbing to OTTL that will treat pcommon.Slice and []any as the same and pcommon.Map and map[string]any as the same, so I believe the primary work would be a new, extremely simple, "filter-specific" context and plumbing in internal/filterottl. I think it is possible that the context is a noop - there situations shouldn't actually transform and save any telemetry, we just need OTTL Conditions to help us make a decision.

But taking a step back, I believe we can separate the configuration API itself from the underlying implementation details. I am not interested in changing the glob pattern matching that filelogreceiver uses - it is a very natural and appropriate form of identifying file paths to collect. Also, I don't fully comprehend situation 2. I don't think receivers should be using a filter configuration to determine what data to grab and how to modify it at the same time. I'd rather the modification of any data be handled in the transformprocessor.

@dmitryax I think this would be a good topic for the SIG meeting.

dmitryax commented 1 year ago

I don't think receivers should be using a filter configuration to determine what data to grab and how to modify it at the same time. I'd rather the modification of any data be handled in the transformprocessor.

Sometimes it needs to be done in the same component. For example, we need to grab key/value pairs from k8s pod annotations AND labels making sure that key collisions are resolved. k8sattributes processor sets them with defferent prefies: k8s.pod.labels. and k8s.pod.annotations. by default, but it's configurable with regex groups. Potentially, we can simplify it and have just prefixes to be configurable.

TylerHelmuth commented 1 year ago

Reading through the k8sattributes config I now see what you mean. It make sense why it allows setting tag_name based on the regex key capture groups, but my preference would be incorporating OTTL somehow, but not if it would overcomplicate things. If feels reasonable to rewrite a configuration like

extract:
   labels:
     - tag_name: $$1
        key_regex: kubernetes.io/(.*)

as something like

extract:
   labels:
     # IDK what the actual syntax would look like for this as we havent written an extract function
     # We are adding named parameters tho which ensure clarity
     - extract(attributes, "$$1", "kubernetes.io/(.*)")

But again, I don't want to add overcomplexity where it isn't needed. Introducing OTTL functions and conditions gives a lot of flexibility, but I don't want the added complexity if it isn't needed.

Behind the scenes I still think it would be good to use OTTL, that way we aren't managing multiple filter frameworks.

TylerHelmuth commented 1 year ago

Considering non-OTTL options I lean towards option B, with the addition of a glob field.

kentquirk commented 1 year ago

As said in the meeting, please keep glob available for filenames; they're too useful and well-understood.

I favor option B as well.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

pkg/ottl: @TylerHelmuth @kentquirk @bogdandrutu @evan-bradley

See Adding Labels via Comments if you do not have permissions to add labels yourself.

povilasv commented 11 months ago

I also like option B), it's pretty convenient and readable, you don't need to learn new syntax of / means regex, etc.

Though how would we support excluding elements? Two different fields with the same config structure, named include /exclude?

For example including certain resource attribute values:

resource_attributes:
  string.enum.resource.attr:
    description: Resource attribute with a known set of string values.
    type: string
    enum: [one, two]
    enabled: true
    include:
      - strict: one
      - regex: one|two

Example of excluding certain attribute values:

resource_attributes:
  string.enum.resource.attr:
    description: Resource attribute with a known set of string values.
    type: string
    enum: [one, two]
    enabled: true
    exclude:
      - strict: two

Another interesting complication from filtering resource attributes, is that they have a type: String, Slice, Map, Bytes. Would this work with maps / slice?

dmitryax commented 11 months ago

Though how would we support excluding elements? Two different fields with the same config structure, named include /exclude?

Sounds good to me. We should clearly document the precedence logic that is applied between them.

Another interesting complication from filtering resource attributes, is that they have a type: String, Slice, Map, Bytes. Would this work with maps/slice?

I think for other types, we should only support the strict mode.

andrzej-stencel commented 7 months ago

I also like option B. It's concise - brief enough, but comprehensible and unequivocal at the same time.

I would change the term strict to exact, seems this is the term usually used - "exact match", "regex match", etc.

open-telemetry / opentelemetry-collector-contrib