jaegertracing / jaeger

CNCF Jaeger, a Distributed Tracing Platform
https://www.jaegertracing.io/
Apache License 2.0
20.22k stars 2.41k forks source link

[jaeger-v2] Storage backend integration tests #5254

Closed james-ryans closed 3 months ago

james-ryans commented 6 months ago

Requirement

With the Jaeger storage extension for Jaeger-v2 is going to have full support of Jaeger-v1's storage backends, some unit tests on every storage backends are not enough. We need to conduct end-to-end tests of OpenTelemetry Collector pipeline to the targeted database.

Problem

There are still no integration tests to test the actual stored traces to the database from V2 Jaeger storage extension.

Proposal

Fortunately, OpenTelemetry Collector already has a testbed framework to help us conducting the end-to-end tests.

Testbed is a controlled environment and tools for conducting end-to-end tests for the OpenTelemetry Collector, including reproducible short-term benchmarks, correctness tests, long-running stability tests and maximum load stress tests. However, we will only utilize the correctness tests from testbed, it generates and sends every combinatorial trace attributes and matches every single of them with the received traces from another end.

Architecture of the integration test

Here's the architecture we will use to test the OpenTelemetry Collector pipeline from end-to-end with the designated storage backends. jaeger-v2-testbed Testbed components:

Plan

The execution of integration tests will be done incrementally one by one on every supported storage backends:

Open questions

No response

yurishkuro commented 6 months ago

@james-ryans as I was reviewing the PRs that follow from this issue, I am starting to have some concerns with this approach. Here is the set of requirements that I think we need to meet:

  1. we need to exercise the full pipeline to write data externally and verify that it makes it to the storage
    • (1b) we need to write data in different formats, not just OTLP
  2. we then also need to exercise the querying API
  3. we need to exercise archiving capability
  4. we need to validate that the config files we're providing in cmd/jaeger are valid by doing an e2e smoke test
    • (4b) in v1 we also had some docker-compose files that need to be tested
  5. we need to generate code coverage for some parts of the code that do not get exercised in unit tests (usually related to initializing the storage drivers)
  6. we need to provide a capability for external plugin providers (implementing gRPC Storage API, such as Quickwit or Postgress plugin) to also run e2e test for writing and querying, as a way of certifying compatibility with Jaeger

In the current state:

I think we can solve all 6 requirements by building upon our existing integration tests rather than with OTEL testbed. Perhaps we can also find a way to utilize the testbed's data generation ability and incorporate it as a step in the overall integration, but on itself I don't see how it can solve all requirements.

Achieving this will streamline our integration tests by converging onto a single framework, instead of using 3 different ones for bits and pieces. This is probably a large task, so I would like to find a path of incremental improvements that lead us to the overall goal. Let's give it some thought.

james-ryans commented 5 months ago

There are some points that are still ambiguous for me, and I want to clarify things. Right now, I’ll just want to focus on the first three points of your vision and intention:

yurishkuro commented 5 months ago

Yes, that is all correct. For instance, with ES, in the unit test model the test will instantiate es.SpanWriter and when it calls writer.WriteSpan() it's an in-process call to the es storage implementation that writes directly to ES. But in e2e mode, a different SpanWriter will be instantiated that executes an OTLP-RPC request to the running collector, where it will be accepted by the receiver and written to storage by the exporter.

yurishkuro commented 5 months ago

unit test mode

flowchart LR
    Test -->|writeSpan| SpanWriter
    SpanWriter --> B(StorageBackend)
    Test -->|readSpan| SpanReader
    SpanReader --> B

    subgraph Integration Test Executable
        Test
        SpanWriter
        SpanReader
    end

e2e test mode

flowchart LR
    Test -->|writeSpan| SpanWriter
    SpanWriter --> RPCW[RPC_client]
    RPCW --> Receiver
    Receiver --> Exporter
    Exporter --> B(StorageBackend)
    Test -->|readSpan| SpanReader
    SpanReader --> RPCR[RPC_client]
    RPCR --> jaeger_query
    jaeger_query --> B

    subgraph Integration Test Executable
        Test
        SpanWriter
        SpanReader
        RPCW
        RPCR
    end

    subgraph jaeger-v2
        Receiver
        Exporter
        jaeger_query
    end
james-ryans commented 5 months ago

I have created an action plan to provide us with a clear, structured pathway so we can execute this in parallel. Some thoughts are welcome if my idea doesn't match with our vision.

  1. Prototyping the new integration tests

    1. Implement the unit test that exercise the querying API (2) and archives (3). Initializing storage driver codes (5) should automatically covered with this tests.


      My thoughts of how this will be implemented is we need only to pass the config to the setup function that will starts the storage extension, within the setup we can retrieve the SpanWriter and SpanReader. Not sure, but probably we can reuse the StorageIntegration module.
Also I found that the archiving capability only tested on Elasticsearch storage.

    2. Extend the unit test for the e2e test, but instead of starts only the storage extension, we use config file in cmd/jaeger to spawn the whole collector pipeline (4), then implement the SpanWriter and SpanReader to send span data through gRPC requests to receiver and from jaeger_query (1).

    List of the storage backends that need to be tested:

    • memory
    • gRPC
    • badger
    • cassandra
    • elasticsearch
    • opensearch
  2. Refactoring and an example for external plugin provider.

    1. Refactor the unit test and e2e test to run in the same workflow so they use the same storage backend. Need to extra careful with the previous written data.
    2. Refactor bootstrapping tests to rely on docker-compose files.
    3. Add an example on how to test external plugin providers with our gRPC storage tests.
  3. Add the crossdock tests.

With this, we can prototype the unit test and e2e test modes in parallel. But, after the unit test is merged, we need to refactor the e2e test to have a similar structure. Once the unit test and e2e test for one of the storage backends are merged, we can continue working on the other backends. After that, we can do refactoring and an example from plan 2 in parallel. The last one is to give some thoughts and find out how to test the interoperability between SDKs and exercise the receipt of different formats of data in crossdock fashion.

james-ryans commented 5 months ago

And I'll try to prototype the e2e test for the gRPC storage backend since @Pushkarm029 is working on the gRPC unit test.

yurishkuro commented 5 months ago

@james-ryans a couple thoughts

james-ryans commented 5 months ago

Ohh wow, nice.. I overlooked that this task exists. I'll take a look at it.

  • my diagrams only show the extension of the existing /integration/ tests to work in e2e mode. Do you see the benefits of also using OTEL testbed in this setup?

Some components of it might be useful but we can implement it on our own with ease if we want to, probably modifying it for our specific use case. I'm thinking that we should be able to use OTEL testbed collector (testbed/testbed/in_process_collector.go) to start the jaeger-v2.

Probably, also the OTEL testbed sender component to write the span data through RPC request. However, I still need to examine it to get a concrete picture. One concern is that the sender lacks the functionality to close the RPC connection.

yurishkuro commented 5 months ago

One main difference to me is that our integration tests generate very specific traces and then query for them in very specific ways, to actually exercise the querying capabilities & permutations. But OTEL testbed just generates a random flood of data and only checks that it all gets through (not even that, as I believe it only checks the IDs). That was really my question - what is the value of such data source? It's not really fuzz-testing since the data is still hardcoded (just permutated for the load). I could see it potentially being useful for stress testing, but we don't do that today (would need dedicated HW, not GH runners).

james-ryans commented 5 months ago

The sender is just a wrapper for OTLP exporter that we are able to call the ConsumeTraces func with our specific provided traces and the remaining will be handled by the sender to do the RPC requests. The OTEL testbed has data provider and sender components, and the data provider component is the one that generates random traces and pushes them through the sender. With the sender alone we should be able to utilize it for our integration tests.

james-ryans commented 5 months ago

@yurishkuro with the new integration requirements, we don't need to test the collector pipeline with testbed as I proposed before anymore, is that right? If it is, we can just delete it.

yurishkuro commented 5 months ago

I think so, but that was really my question to you - if we used the testbed, what additional aspects or behavior would it be testing?

james-ryans commented 5 months ago

Okay. It doesn't give any benefit anymore at this point, since all the test cases are already covered by the existing StorageIntegration. But we can use some parts of the components for ourselves to provide an easier setup for the new integration tests.

yurishkuro commented 5 months ago

Copying from https://github.com/jaegertracing/jaeger/pull/5355#discussion_r1566018600 - let's add this to the README.

flowchart LR
    Receiver --> Processor
    Processor --> Exporter
    JaegerStorageExension -->|"(1) get storage"| Exporter
    Exporter -->|"(2) write trace"| Badger

    Badger_e2e_test -->|"(1) POST /purge"| HTTP_endpoint
    JaegerStorageExension -->|"(2) getStorage()"| HTTP_endpoint
    HTTP_endpoint -.->|"(3) storage.(*Badger).Purge()"| Badger

    subgraph Jaeger Collector
        Receiver
        Processor
        Exporter

        Badger
        BadgerCleanerExtension
        HTTP_endpoint
        subgraph JaegerStorageExension
            Badger
        end
        subgraph BadgerCleanerExtension
            HTTP_endpoint
        end
    end