Scraper feedback, and what to do next?

bogdandrutu commented 1 month ago

Current state:

While https://github.com/open-telemetry/opentelemetry-collector/issues/9473 was a right call in terms of strangeness that from a string we construct an ID, we should not pass a "type" because the idea was that a "receiver" can have multiple scrapers, but if that is true, all of them will be the same receiver/component type correct?

We have made a mistake probably (big part of the problem is the author of this issue) with the design of the scrapers. Here are a set of questions we need to answer, and based on that we can change the design of the scraper package:

Should the "scraperhelper" be a receiver that supports multiple "scrapers" or a receiver that supports one Scraper and we can have a "multi source" scraper in case we really need to support a receiver to scrape multiple sources?
Should we have a standalone "scraper" type instead of making this a receiver? Or maybe we have only one "scraperreceiver" that allows users to configure multiple scrapers?
Should we have a special receiver that allows configuring scrapers?

During the decision and discussion it should be considered that currently we have a big problem, which is that for each scraper/receiver we create its one timer and we lose the opportunity of batching at the source. Another feedback we got in the past with the current a scraper as a receiver is that user has to configure the scrape interval individually for each receiver.

I like the option 3. A proposed idea is to start from how user should configure this, and see how we can achieve this making sure we "batch" as much as possible all these scrapings:

receivers:
  scraper/30sec:
    collection_interval: 30s
    initial_delay: 10s
    timeout: 10s
    sources:
      sql:
        ... // things specific to sql including enable/disable metrics
      nginx:
        ... // things specific to nginx including enable/disable metrics.

Later we can add capability to do "source discovery" or build a different receiver like the current "receivercreator" which can be adopted to follow the same pattern/structure.

dmitryax commented 1 month ago

I like the idea. We can have just one scraper receiver eventually. I see a few opportunities with this approach:

It'll remove the confusion between "push" and "pull" receivers. All the receivers will be "push" except for the scraper, and we don't need kafkametrics in addition to the kafka receiver.
The batching, as you mentioned, with consistent timestamps. However, it can lead to huge payloads that potentially have to be split on the exporter side. So, batching can probably be a configurable behavior. Also, I think we should still initiate separate workers for each source scraping the data concurrently + one main worker batching them together if it's enabled.
Built-in discovery would work perfectly with this new receiver. Each source can expose an optional Discoverable interface in addition to Start(...) and Scrape(...). Each source could have default built-in discovery rules for each observer with an option to override them by the user. In that case, we won't need the receiver_creator anymore.
We can have additional global (applied to all sources) filtering and aggregation capabilities. Currently, each scraping receiver provides an option to disable metrics (and potentially aggregation), but it has to be configured for every metric separately, which is not always convenient. For example, I want to disable all metrics starting with system.processes. but don't want to list all of them individually. We tried to add wildcard support into the existing metrics config group. It would be much better to keep the scraper.sources.<scraper>.metrics config schema strict and move the filtering somewhere else, e.g., scraper.filter. cc @braydonk

braydonk commented 1 month ago

Re 4:

It would be much better to keep the scraper.sources..metrics config schema strict and move the filtering somewhere else, e.g., scraper.filter

Perhaps, I think it's a cleaner config interface for sure. Provided I'm understanding this idea vs the current state of things, configuring enable/disable at the start has one big benefit: scrapers can recognize that certain metrics aren't required at scrape time, and can subsequently opt to ignore them. I have a hostmetrics PR where skipping collection for a particular metric when it's disabled yielded major performance benefits.

FWIW, the spec I wrote for that wildcard matching [1] has the potential to be applied in a way other than the current PR. I would happily welcome a better place for it to live than as generated mdatagen code like in the PR.

[1]: As of writing this is due for updates, since Dmitrii and I have already agreed on a major simplification I can make. But the spirit will remain.

bogdandrutu commented 1 month ago

However, it can lead to huge payloads that potentially have to be split on the exporter side. So, batching can probably be a configurable behavior.

Indeed we can configure this, but it gives us the opportunity to have it at lower cost then batching later (less allocations for example).

Also, I think we should still initiate separate workers for each source scraping the data concurrently + one main worker batching them together if it's enabled.

We can have a pull of workers (control also how many go routines are spawned) for example or other approach. Definitely we should have it to be able to scale to lots of sources.

Built-in discovery would work perfectly with this new receiver.

Agree

ChrsMark commented 1 month ago

Built-in discovery would work perfectly with this new receiver. Each source can expose an optional Discoverable interface in addition to Start(...) and Scrape(...). Each source could have default built-in discovery rules for each observer with an option to override them by the user. In that case, we won't need the receiver_creator anymore.

Does this mean that people developing the scrapers/receivers will be responsible for implementing the Discoverable interface for each observer/technology? That might be tricky because people willing to implement a specific receiver/scraper might not care/know about discovering this in various technologies, docker, K8s etc.

I would also like to highlight the proposal to add support for suggestion/hint based discovery:

https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/34427
https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/17418 @dmitryax I assume this would be doable to be supported in this new approach by including the suggestion/hints in the built-in discovery rules?

bogdandrutu commented 1 month ago

I think the "discovery" will be a functionality of the scraper receiver that will emit "discovery events" on an internal bus, the scrapers can "subscribe" to the bus and read the discovery events and start scraping if needed.

mx-psi commented 1 month ago

One small organization PR over at #11166 will allow us to continue evolving the scraper API while allowing us to mark the rest of receiver API 1.0 first (and thus unblocking stabilizing receivers that are not scraper-based). It doesn't change anything about the scraper API surface other than moving things to a new folder, but please do take a look!

sincejune commented 1 week ago

Hi team, I'm wondering that if we are planning to support scraper helper to collect logs(pass data as scraped logs to the next consumer). This is a useful use case like collecting container logs. (posting the question here since this is a thread for scraper feedback :)

open-telemetry / opentelemetry-collector

Scraper feedback, and what to do next? #11238