[RFC] Replacing reducer with stream-processing framework Arroyo, POC

Describe the issue you're reporting

The reducer processes collector telemetry to time series. In recent years, stream processing frameworks have emerged that can perform a similar task, with lower implementation complexity and standardized interfaces. Using a stream processing framework for the reducer should reduce the maintenance load and help more easily evolve the project.

One promising stream processing framework is Arroyo, which seems dual-licensed MIT/Apache2.0. Arroyo takes in pipeline definitions in SQL, and configures streaming workers to materialize the results. @mwylde gave a talk at Data Council'24: YouTube link and after discussing with him and @jacksonrnewhouse, it seems like it could be a good fit for the reducer.

How could such an implementation look?

Collectors create and update "intake" tables for sockets, processes, containers, Kubernetes Deployments etc.
A first stage streaming pipeline joins the "intake" tables together to a socket table, which contains per-socket rows, enriched with available metadata: the socket 5-tuple with process, node, Kubernetes, and cloud metadata on both sides.
An aggregation table computes aggregations from the socket granularity table to what is currently on offer from the reducer (service-service, az-az, etc.)
The stream processor outputs aggregated rows at regular intervals using a tumbling windowed query, e.g., every 5 seconds.
Resulting rows flow to an OpenTelemetry sink that encodes into OTLP metric format and sends to the Otel collector.

To get more insight of how good a fit this is, and the amount of effort required, we should build an initial proof-of-concept using raw telemetry dumps from collectors to reducer, to Arroyo and demonstrate joins and aggregations. A minimal demo might stream 30 seconds of telemetry from a single kernel collector, and show output of total throughput by container.

Does this require a lot of development? It seems we have most of the tooling to make this work. The collectors already have the ability to dump the binary stream that would go to the reducer ("intake" data) to a file, see IntakeConfig and kernel_collector_test (config and instantiation), and it seems possible to control dumps via an environment variable. We have existing tooling to translate the binary data to JSON.

Arroyo accepts several source connectors. The POC might stream JSON to Fluvio(Apache-2.0) which Arroyo accepts. The Fluvio CLI accepts records from files or stdin, so while this adds another components to the POC, it appears it requires no code modifications, just configuration.

How will this look eventually? There are more steps to get there, but to provide a general idea if this isn't clear from above:

The collectors could output JSON
Bundle the fluvio CLI with the collectors, and stream JSON to it
The reducer will comprise a Fluvio Stream Processing Unit (SPU), an Arroyo system (worker+controller)
We contribute an OpenTelemetry metric sink to Arroyo, so Arroyo sends metrics directly to the Otel collector.

cc @open-telemetry/network-maintainers @open-telemetry/network-approvers @open-telemetry/network-triagers

open-telemetry / opentelemetry-network

[RFC] Replacing reducer with stream-processing framework Arroyo, POC #263

Describe the issue you're reporting