open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.11k stars 1.37k forks source link

[confmap] Stricter types on configuration resolution #9532

Closed mx-psi closed 1 week ago

mx-psi commented 5 months ago

We enable the WeaklyTypedInput option from mapstructure. This option implies the following implicit conversions from the internal confmap's map[string]any to our Config structs:

  1. bools to string (true = "1", false = "0")
  2. numbers to string (base 10)
  3. bools to int/uint (true = 1, false = 0)
  4. strings to int/uint (using strconv.ParseInt)
  5. int to bool (true if value != 0)
  6. string to bool (accepts: 1, t, T, TRUE, true, True, 0, f, F, FALSE, false, False. Anything else is an error)
  7. empty array = empty map and vice versa
  8. negative numbers to overflowed uint values (base 10)
  9. slice of maps to a merged map
  10. single values are converted to slices if required. Each element is weakly decoded. For example: "4" can become []int{4} if the target type is an int slice.

We also do implicit casting to string when expanding a ${provider:URI} variable: https://github.com/open-telemetry/opentelemetry-collector/blob/cf51a35663302c133ee6caba1946fcb7c2421ca4/confmap/expand.go#L124-L139

I think this is too lax and commits us to supporting too many weird edge cases (in particular all but 2 and 4 seem like clearly things we should remove, and I would argue even 2 and 4 should be removed eventually). This conflicts with the changes I proposed on #9515, so I think we should resolve this one first.

djaglowski commented 5 months ago

How do we assess the impact of this? Doesn't this effectively mean a very large number of breaking changes for users?

commits us to supporting too many weird edge cases

Can you give an example? Is the problem that custom unmarshalers may preempt these input type conversions?

mx-psi commented 5 months ago

@djaglowski Here is an example configuration that is accepted by the OpenTelemetry Collector today and has all the edge cases listed above for WeaklyTypedInput:

receivers:
  otlp:
    protocols:
      grpc:
        max_concurrent_streams: -1 # Wraps around to math.MaxUint32 - 1 (8)

exporters:
  otlp:
    endpoint: endpoint.local
    headers:
      - truthy: true # Sets header to truthy: 1  because of implicit casting to string (1)
        number: 0123 # Sets header to number: 83 because of int (base 8) -> string -> int (base 10) (2)
      - this-is-actually-a-slice: yikes # The two maps get merged into one with three entries (9)
    read_buffer_size: true # Sets read_buffer_size to 1 (3)
    write_buffer_size: "1234" # sets write_buffer_size to 1234 (4)
    keepalive:
      permit_without_stream: 23 # sets permit_without_stream to true (5)
    tls:
      insecure: t # sets insecure to true (6)

service:
  pipelines:
    metrics:
      receivers: otlp # <-- Implicit casting of value to slice (10)
      processors: {} # <-- Implicit casting of empty map to empty slice (7)
      exporters: [otlp]

7-10 are specially confusing and franly seem nonsensical to me, and 1-6 deviate us from the YAML spec which I think leads to confusion.

mx-psi commented 5 months ago

How do we assess the impact of this? Doesn't this effectively mean a very large number of breaking changes for users?

As an (incomplete) start we can take a look at open-telemetry/opentelemetry-collector-contrib/pull/31188. There are four tests that need changes here for rule (2), all of them are of the form header1: 234 which I think is not super frequent.

To be honest (and again, this is my personal opinion and we should think about how to back with data) I believe the changes related to #8510 are going to be much more impactful in terms of people having to change their configurations than a change like this.

songy23 commented 5 months ago

From today's SIG: @songy23 to research on the difference in the type conversion rules between YAML library and mapstructure. We probably want to preserve the conversion rules overlapping with YAML library.

andrzej-stencel commented 5 months ago

Looking at all the implicit conversions listed by Pablo, I'm of the opinion that they should be removed, meaning the WeaklyTypedInput flag should be set to false.

I suppose it's virtually impossible to assess the impact this change would have. We don't know if users have configurations that rely on any of these implicit conversions. Speaking from my experience, I think the only place I might have seen one of those conversions used is in the Filelog receiver's include option:

receivers:
  filelog:
    include: logs.log

To make transition as painless for users as possible, we could prepare a migration plan like:

  1. Start warning users of configuration errors while still accepting the erroneous config
  2. Add a feature gate to guard the change

Anything else we can do?

mx-psi commented 4 months ago

Warnings could be done by doing unmarshaling twice, once with WeaklyTypedInput set to false and another with it set to true, and reporting any differences.

mx-psi commented 3 months ago

Note: when we address this we need to remove the hook added in #9169