[jaeger-v2] Jaeger v1 vs. v2 Benchmarking

jkowall commented 4 months ago

Background

Jaeger is an open-source, end-to-end distributed tracing system that helps monitor and troubleshoot transactions in complex, microservices-based environments. Jaeger v2 is a major new version where we rebase all Jaeger backend components (agent, collector, ingester, and query) on top of the OpenTelemetry Collector, bringing significant improvements and changes to the platform.

The transition from v1 to v2 introduces significant architectural changes, particularly in the collector component. As part of this transition, it's crucial to understand the performance implications of these changes through comprehensive benchmarking.

Relevant links:

Jaeger: https://www.jaegertracing.io/
Jaeger v2 RFC
Jaeger Collector (v1): https://www.jaegertracing.io/docs/1.41/architecture/#collector
OpenTelemetry Collector (used in v2): https://opentelemetry.io/docs/collector/
Previous benchmarking project: https://medium.com/jaegertracing/making-design-decisions-for-clickhouse-as-a-core-storage-backend-in-jaeger-62bf90a979d

Project Objective

The goal of this project is to develop a comprehensive benchmarking suite that compares the performance of Jaeger v1 and v2, with a primary focus on the collector component. This benchmarking will provide valuable insights into the efficiency, scalability, and resource utilization of both versions, helping the community understand the benefits and potential trade-offs of migrating to Jaeger v2. The CNCF will provide compute resources on this project if needed. Please coordinate with the mentors.

Key Features and Implementation

Benchmarking Environment Setup
- Develop a reproducible environment for running benchmarks, using tools like Docker.
- Ensure consistent hardware and software configurations for fair comparisons.
- Create scripts to automate the deployment of Jaeger v1 and v2 components in isolation.
- Support multiple backends for benchmarking (ElasticSearch, OpenSearch, Cassandra)
Workload Generation
- Utilize cmd/tracegen as a workload generator that can simulate various real-world scenarios.
- Develop mechanisms to control the rate and volume of span ingestion.
Performance Metrics Collection
- Implement collection of key performance indicators, including:
  - Throughput (spans processed per second)
  - Latency (processing time per span)
  - Resource utilization (CPU, memory, network, disk I/O)
  - Dropped span rate under high load
- Utilize Prometheus for metrics collection and storage.
- Utilize Grafana for reporting and dashboarding of the data
Storage Backend Integration
- Evaluate collector performance with different storage backends (Elasticsearch, Cassandra, OpenSearch).
- Measure the impact of different storage configurations on collector performance.
Data Processing and Analysis
- Generate comprehensive dashboards and reports comparing v1 and v2 performance across different scenarios.
Documentation and Reproducibility
- Prepare a blog post summarizing the results
- Create detailed documentation of the benchmarking methodology, environment setup, and test scenarios.
- Develop a guide for running the benchmarks, allowing community members to reproduce and verify results.

Expected Outcome

By the end of this project, we aim to have:

A comprehensive, automated benchmarking suite for comparing Jaeger v1 and v2 collectors.
Detailed performance report highlighting the strengths and potential areas of improvement in Jaeger v2.
Clear insights into the scalability and efficiency gains (or trade-offs) in Jaeger v2.
A set of recommendations for users considering migration from v1 to v2, based on performance characteristics.