open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.38k stars 1.45k forks source link

[exporterhelper] User interface for controling queue measurement #9462

Open dmitryax opened 8 months ago

dmitryax commented 8 months ago

Following the discussion https://github.com/open-telemetry/opentelemetry-collector/pull/8853#discussion_r1435671202

As part of https://github.com/open-telemetry/opentelemetry-collector/issues/8122, the batching capability is moved to the exporter side. The batch sizes are configured in terms of items by default. The current queue can be configured only in terms of requests. The different measures can be confusing to the end user. At least, we need to provide a way to configure the queue in terms of items that would be consistent with the batching.

This issue is intended to answer the following questions:

  1. How to define the user interface for the end user to configure different queue measurements.
  2. Do we want to migrate the default measure for the queue from requests to items to be consistent with the batching configuration?

Currently, the sending_queue config group in the exporter configuration has the only field to control the queue capacity called queue_size, which means the maximum number of requests the queue can hold. The queue prefix can potentially be dropped as redundant during the migration to the new interface.

The following measures can be supported by the new exporter queue additionally. The options 1 and 2 are already implemented, just not exposed to the user until this issue is resolved. Option 3 can be added in the future.

  1. Number of basic items: log records, spans, data points - this should be the default measure going forward to be consistent with config for batching
  2. Number of incoming requests - this is the current measure used by the queue_size field. We should keep it as an option.
  3. Size in MiB - will be nice to have in the future. We have similar options in the memory limiter: limit_mib and spike_limit_mib.

We have the following options how the different queue measurements can be configured:

A. Provide different fields for each option and validate that only one is set. The following options can be added under sending_queue config group in the exporter configuration

B. Keep queue_size (or chose another name like size) and introduce another field for the queue_measure (or measure) enum which can be items, requests or MiB. (vote with ❤️ )

C. Keep queue_size (or chose another name like size) and support suffixes like 10requests, 10items, 10[B|KB|MB..] and keep the queue_size field. (vote with 🚀 )

Keep in mind that the batching interface has to be aligned with what we come up with. https://github.com/open-telemetry/opentelemetry-collector/pull/8685 currently suggests the following options: min_size_items, max_size_items. So it's aligned with option A at the moment. Potentially, we can provide different measures for batching as well, but initially, the batching will be items-based only.

If we decide to migrate from requests to items by default in the sending_queue group, we need a migration plan. This cannot be done under the hood since it'll affect users. Option A requires deprecating queue_size and moving to another config option. So we can align that change with this migration and mention the default measurement change in the field deprecation warning. However, options 2 and 3 don't require a config option change if we keep using queue_size. So it's better to change it to something else likesize` instead to provide the same migration experience.

Emojis for voting: 👍 Yes, migrate the default queue measure to items 👎 Keep the requests as measure by default, even if it's not consistent with the batcher. 👀 Hold off on the migration until we have a byte-count measure, which eventually can be a new default

It's important to mention that any decision we make for this issue won't take place immediately. It'll be applied only after https://github.com/open-telemetry/opentelemetry-collector/issues/8122 is ready and stable by components moving to the new exporter helper.

@open-telemetry/collector-approvers please share your thoughts.

dmitryax commented 8 months ago

I added emojis for voting but feel free to suggest other options as well

OverOrion commented 8 months ago

I think that having the queue size in MiB would be awesome and makes it easier to understand (especially for first time users). Also the operator could utilize it to automatically create Persistent Volumes based on these values.

jmacd commented 8 months ago

No emoji option was given, I don't like either option:

👍 Yes, migrate the default queue measure to items 👎 Keep the requests as measure by default, even if it's not consistent with the batcher.

The third option not listed (must be :eyes:), is to wait for option 3 and migrate the default queue measure to size_limit_mib. This behavior is the most consistent with the batch processor, since both items and requests can vary in size. For the queue to have a fixed-size memory footprint, which is my preference, we should migrate to a byte-count measure.

dmitryax commented 8 months ago

Thanks @jmacd, added

quentinmit commented 6 months ago

I think a unit smaller than MiB should be used for batch sizes; most common export APIs have limits near the 1 MB mark. For instance, CloudWatch's max batch size is 1 MiB. Google Cloud Logging's max batch size is 10 MiB. Promtail limits requests to 4 MiB. Splunk has a max batch size of 2 MiB. DataDog's limit is 3.2 MiB.

I would suggest using KiB or even just bytes.

This is also something that should probably be controllable by the individual exporter; I would expect the max batch size to be limited to an exporter-controlled max byte size, since there's no point in trying to send 20 MiB batches to CloudWatch.

jpkrohling commented 6 months ago

Agree with @jmacd and @quentinmit, size_limit_mib (or rather, size_limit_bytes) would be preferable from a UX perspective. Users usually have no idea how big their batches or spans are, and are more comfortable making decisions when there's a clear correlation with cost: "I need X GiBs for this queue, which means I need a machine that costs me $y."

matthewmodestino commented 2 months ago

I think a unit smaller than MiB should be used for batch sizes;

Agree with this 100%. smaller batches generally scale better as well. Splunk cloud we try and encourage 256KB as a starting point, well below the 1MB soft limit we recommend.