kubeshop / tracetest

🔭 Tracetest - Build integration and end-to-end tests in minutes, instead of days, using OpenTelemetry and trace-based testing.
https://docs.tracetest.io/
Other
1.08k stars 73 forks source link

tracetest + signoz integration in K8S cluster not working as expected. #2987

Open Connect2naga opened 1 year ago

Connect2naga commented 1 year ago

Hi

Thanks for providing support for signoz https://github.com/kubeshop/tracetest/pull/2935, we verified with Docker Compose steps provided by the repo & working fine.

having few issues facing K8s setup.

1. Source not visible in UI image

2. Test case executed successfully but not able to see the

Signoz Configurations:


 $:\> kubectl get svc -nobservability
NAME                                        TYPE           CLUSTER-IP      EXTERNAL-IP     
signoz-zookeeper-headless                   ClusterIP      None            <none>          2181/TCP,2888/TCP,3888/TCP                                                                  29d
signoz-query-service                        ClusterIP      10.43.179.107   <none>          8080/TCP,8085/TCP                                                                           29d
signoz-k8s-infra-otel-deployment            ClusterIP      10.43.39.148    <none>          13133/TCP                                                                                   29d
signoz-alertmanager-headless                ClusterIP      None            <none>          9093/TCP                                                                                    29d
signoz-frontend                             ClusterIP      10.43.152.17    <none>          3301/TCP                                                                                    29d
signoz-zookeeper                            ClusterIP      10.43.54.174    <none>          2181/TCP,2888/TCP,3888/TCP                                                                  29d
signoz-alertmanager                         ClusterIP      10.43.174.239   <none>          9093/TCP                                                                                    29d
signoz-clickhouse-operator-metrics          ClusterIP      10.43.92.69     <none>          8888/TCP                                                                                    29d
signoz-k8s-infra-otel-agent                 ClusterIP      10.43.18.24     <none>          13133/TCP,8888/TCP,4317/TCP,4318/TCP                                                        29d
signoz-otel-collector                       ClusterIP      10.43.154.120   <none>          14250/TCP,14268/TCP,8888/TCP,4317/TCP,4318/TCP                                              29d
signoz-otel-collector-metrics               ClusterIP      10.43.224.5     <none>          13133/TCP                                                                                   29d
signoz-clickhouse                           ClusterIP      10.43.74.190    <none>          8123/TCP,9000/TCP                                                                           29d
chi-signoz-clickhouse-cluster-0-0           ClusterIP      None            <none>          8123/TCP,9000/TCP,9009/TCP                                                                  29d

TraceTest Configurations:


$:\> kubectl get  cm  -ntracetest tracetest -o yaml
apiVersion: v1
data:
  config.yaml: |-
    poolingConfig:
      maxWaitTimeForTrace: 30s
      retryDelay: 5s
    googleAnalytics:
      enabled: true
    postgres:
      host: tracetest-postgresql
      user: tracetest
      password: not-secure-database-password
      port: 5432
      params: sslmode=disable
    telemetry:
      exporters:
        collector:
          exporter:
            collector:
              endpoint: otelcollector.dev.optimizor.app:80
            type: collector
          sampling: 100
          serviceName: tracetest
    receivers:
      otlp:
        protocols:
          grpc:
          http:

    processors:
      batch:
        timeout: 100ms

      # Data sources: traces
      probabilistic_sampler:
        hash_seed: 22
        sampling_percentage: 100

    exporters:
      # OTLP for Tracetest
      otlp/tracetest:
        endpoint: tracetest:4317 # Send traces to Tracetest.
                                 # Read more in docs here: https://docs.tracetest.io/configuration/connecting-to-data-stores/opentelemetry-collector
        tls:
          insecure: true
      # OTLP for Signoz
      otlp/signoz:
        endpoint: signoz-otel-collector.observability.svc.cluster.local:4317
        tls:
          insecure: true

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [probabilistic_sampler, batch]
          exporters: [otlp/signoz,otlp/tracetest]
  provisioning.yaml: |-
    |
      ---
      # Datastore is where your application stores its traces. You can define several different datastores with
      # different names, however, only one is used by Tracetest.
      # You can see all available datastore configurations at https://kubeshop.github.io/tracetest/supported-backends/
      type: DataStore
      spec:
        name: Signoz
        type: signoz
        signoz:
          endpoint: signoz-otel-collector.observability.svc.cluster.local:4317
        tls:
          insecure: true
      ---
      type: Config
      spec:
        analyticsEnabled: true
      ---
      type: PollingProfile
      spec:
        name: Custom Profile
        strategy: periodic
        default: true
        periodic:
          timeout: 30s
          retryDelay: 500ms
kind: ConfigMap
metadata:
  creationTimestamp: "2023-07-25T12:47:42Z"
  labels:
    app.kubernetes.io/instance: tracetest
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: tracetest
    app.kubernetes.io/version: v0.13.0
    helm.sh/chart: tracetest-0.2.69
  name: tracetest
  namespace: tracetest
  resourceVersion: "99424174"
  uid: 35cdcd73-bf2e-4aa6-88a8-69c7f5408432

Test execution: $:>tracetest test run -d ./list-tests.yaml Command "run" is deprecated, Please use tracetest run test command instead. ✔ List all tracetest tests (http://localhost:11633/test/e9c6cff9-974d-4263-8a23-22f1e9f975aa/run/3/test)

Logs

sending event "Get Version" (test)
event sent "Get Version" (test)
2023/07/26 13:40:40 GET /api/version GetVersion 2.374981ms
sending event "Environments.List" (test)
event sent "Environments.List" (test)
sending event "Tests.Upsert" (test)
event sent "Tests.Upsert" (test)
sending event "Run Test" (test)
event sent "Run Test" (test)
persistentRunner job. ID 3, testID e9c6cff9-974d-4263-8a23-22f1e9f975aa, TraceID 69b7012ce92ee34c905a44045d26f9b4, SpanID 016461704990e02f
2023/07/26 13:40:40 POST /api/tests/e9c6cff9-974d-4263-8a23-22f1e9f975aa/run RunTest 23.357878ms
sending event "Tests.List" (test)
event sent "Tests.List" (test)
2023/07/26 13:40:41 [TracePoller] Test e9c6cff9-974d-4263-8a23-22f1e9f975aa Run 3: Poll
2023/07/26 13:40:41 [TracePoller] Test e9c6cff9-974d-4263-8a23-22f1e9f975aa Run 3: Received job
tracePoller processJob 1
2023/07/26 13:40:41 [PollerExecutor] Test e9c6cff9-974d-4263-8a23-22f1e9f975aa Run 3: ExecuteRequest
2023/07/26 13:40:41 [PollerExecutor] Test e9c6cff9-974d-4263-8a23-22f1e9f975aa Run 3: Done polling. (TraceDB is not retryable)
2023/07/26 13:40:41 [PollerExecutor] Test e9c6cff9-974d-4263-8a23-22f1e9f975aa Run 3: Start Sorting
[PollerExecutor] Completed polling process for Test Run 3 after 2 iterations, number of spans collected: 1
2023/07/26 13:40:41 [PollerExecutor] Test e9c6cff9-974d-4263-8a23-22f1e9f975aa Run 3: Sorting complete
2023/07/26 13:40:41 [PollerExecutor] Test e9c6cff9-974d-4263-8a23-22f1e9f975aa Run 3: Start updating
2023/07/26 13:40:41 [TracePoller] Test e9c6cff9-974d-4263-8a23-22f1e9f975aa Run 3: Done polling (reason: ). Completed polling after 2 iterations, number of spans collected 1
2023/07/26 13:40:41 [linterRunner] Test e9c6cff9-974d-4263-8a23-22f1e9f975aa Run 3: Starting
2023/07/26 13:40:41 [linterRunner] Test e9c6cff9-974d-4263-8a23-22f1e9f975aa Run 3: update channel start
2023/07/26 13:40:41 [linterRunner] Test e9c6cff9-974d-4263-8a23-22f1e9f975aa Run 3: update channel complete
2023/07/26 13:40:41 [AssertionRunner] Test e9c6cff9-974d-4263-8a23-22f1e9f975aa Run 3: Starting
sending event "test_run_finished" (successful)
event sent "test_run_finished" (successful)
2023/07/26 13:40:41 [AssertionRunner] Test e9c6cff9-974d-4263-8a23-22f1e9f975aa Run 3: Success. pass: 0, fail: 0
2023/07/26 13:40:41 [AssertionRunner] Test e9c6cff9-974d-4263-8a23-22f1e9f975aa Run 3: update channel start
2023/07/26 13:40:41 [AssertionRunner] Test e9c6cff9-974d-4263-8a23-22f1e9f975aa Run 3: update channel complete
sending event "Tests.Get" (test)
event sent "Tests.Get" (test)
2023/07/26 13:41:02 could not read message: websocket: close 1001 (going away)
sending event "DataStores.Get" (test)
event sent "DataStores.Get" (test)
sending event "Environments.List" (test)
event sent "Environments.List" (test)
sending event "TestRunners.Get" (test)
event sent "TestRunners.Get" (test)
sending event "Configs.Get" (test)
event sent "Configs.Get" (test)
sending event "Analyzers.Get" (test)
event sent "Analyzers.Get" (test)
sending event "PollingProfiles.Get" (test)
event sent "PollingProfiles.Get" (test)
sending event "Demos.List" (test)
event sent "Demos.List" (test)
danielbdias commented 1 year ago

Thanks for opening this issue @Connect2naga!

We'll investigate what happened and let you know as soon as possible once we have an answer.

danielbdias commented 1 year ago

hi @Connect2naga! How are you?

Question: Have you installed Tracetest on your cluster directly using our CLI, or have you used our Helm chart directly to do the installation?

If you installed using helm charts and setting up the provisioning field, it is possible that Tracetest is not provisioning it correctly because there is a | that Tracetest can recognize as part of the Yaml and have problems parsing it. (I added an issue to address that).

The second thing is that the Signoz integration uses an OTLP configuration, which means that we need to configure an OpenTelemetry collector to send the data both to Signoz and Tracetest to make it work. (Like we do on our docker example, defining a configuration for the collector here, and defining the collector container here)

To solve it now, you can set up a yaml with a datastore file, like this datastore.yaml:

type: DataStore
  spec:
    id: current
    name: Signoz
    type: signoz

And running: tracetest apply datastore -f datastore.yaml

danielbdias commented 1 year ago

A critical detail of the Signoz integration is that today we need to use a rule on the OpenTelemetry Collector to redirect traces for Tracetest and Signoz, using a similar structure that we use on the Docker example.

In the future, we plan to support direct integration with Signoz, allowing users to send telemetry directly to Signoz.

On the next link, there is a snippet that creates a test Kubernetes cluster with k3d and does Signoz and Tracetest setup into a cluster: https://gist.github.com/danielbdias/daa7b92fb7a4701fd5d10690bc705d80

In this setup, if your app sends data to otel-collector.tracetest.svc.cluster.local:4317, Tracetest can capture its traces and evaluate them.

In case of any questions feel free to message us here or on Discord channel and we will be glad to help you.

Connect2naga commented 1 year ago

HI @danielbdias ,

Thanks for the information and example with K3s.

Question: Have you installed Tracetest on your cluster directly using our CLI, or have you used our Helm chart directly to do the installation? Answer : we have installed using Helm chart

Having few more questions:

  1. If both Singoz and Tracetest using same collector, then data will push to both systems or only one will read?
  2. if data is read from OTLP, then why we are providing Signoz as data source to Tracetest?
danielbdias commented 1 year ago

hi @Connect2naga !

Thanks for your answers! Here are my answers too:

  1. If both Singoz and Tracetest using same collector, then data will push to both systems, or only one will read?

With this OTel Collector in front Signoz, both systems will receive and read trace data. Tracetest will receive data and process only the trace related with the current tests, and Signoz will store the traces for visualization.

  1. If data is read from OTLP, then why we are providing Signoz as data source to Tracetest?

We do that because some DataStores don't have a direct way to get Traces, so we use this OTel configuration to allow users to use Tracetest. In the case of Signoz, they don't have a direct method to get traces documented yet, but are in touch with them to improve this integration on the issue: https://github.com/SigNoz/signoz/issues/3231 and on their Slack.

Connect2naga commented 1 year ago

HI @danielbdias ,

Thanks for Setup with K3s. was able to setup & execute the test.

test execution failed with below error:

 connect2naga@connect2naga  ~/tracetest  tracetest run test --file samplelogin.yaml --output pretty
✘ samplelogin ()
        Reason: timed out waiting for traces after 1m

testFile which created from UI:

 ✘ connect2naga@connect2naga  ~/tracetest  cat samplelogin.yaml
type: Test
spec:
  id: eQUp8u34R
  name: samplelogin
  trigger:
    type: http
    httpRequest:
      method: POST
      url: http://sample-http-server.sample.svc.cluster.local:8080/api/v1/login
      body: "{\n    \"username\":\"mailadmin@csvijay.in\",\n    \"password\":\"admin\"\n}"
      headers:
      - key: Content-Type
        value: application/json

 connect2naga@connect2naga  ~/tracetest 

Also able to see the traces in Signoz : image

Tracer Configurations(same like as suggested, common tracer used for both Signoz & tracetest) :

 connect2naga@connect2naga  ~/tracetest  kubectl get cm -n tracetestdemo collector-config -o yaml
apiVersion: v1
data:
  collector.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:

    processors:
      batch:
        timeout: 100ms

      # Data sources: traces
      probabilistic_sampler:
        hash_seed: 22
        sampling_percentage: 100

    exporters:
      # Output logger, used to check OTel Collector sanity
      logging:
        loglevel: debug

      # OTLP for Tracetest
      otlp/tracetest:
        #endpoint: tracetest.tracetestdemo.svc.cluster.local:4317
        endpoint: signoz-otel-collector.observability.svc.cluster.local:4317
        tls:
          insecure: true
      # OTLP for Signoz
      otlp/signoz:
        endpoint: signoz-otel-collector.observability.svc.cluster.local:4317
        tls:
          insecure: true

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [probabilistic_sampler, batch]
          exporters: [otlp/signoz, otlp/tracetest, logging]
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"collector.yaml":"receivers:\n  otlp:\n    protocols:\n      grpc:\n      http:\n\nprocessors:\n  batch:\n    timeout: 100ms\n\n  # Data sources: traces\n  probabilistic_sampler:\n    hash_seed: 22\n    sampling_percentage: 100\n\nexporters:\n  # Output logger, used to check OTel Collector sanity\n  logging:\n    loglevel: debug\n\n  # OTLP for Tracetest\n  otlp/tracetest:\n    endpoint: tracetest.tracetestdemo.svc.cluster.local:4317\n    tls:\n      insecure: true\n  # OTLP for Signoz\n  otlp/signoz:\n    endpoint: signoz-otel-collector.observability.svc.cluster.local:4317\n    tls:\n      insecure: true\n\nservice:\n  pipelines:\n    traces:\n      receivers: [otlp]\n      processors: [probabilistic_sampler, batch]\n      exporters: [otlp/signoz, otlp/tracetest, logging]\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"collector-config","namespace":"tracetestdemo"}}
  creationTimestamp: "2023-08-01T10:06:17Z"
  name: collector-config
  namespace: tracetestdemo
  resourceVersion: "102917356"
  uid: a6928ede-7fb1-48f5-ac62-bad9d20c1ac3
 connect2naga@connect2naga  ~/tracetest 

Please help here, got same problem in UI as well as CLI.

Application using signoz-otel-collector.observability.svc.cluster.local:4317 as a collector.

Also tried same configuration provided in k3s , same issue no change.

danielbdias commented 1 year ago

hi @Connect2naga ! how are you?

Looking into your collector.yaml I noticed that otlp/tracetest exporter is sending data to Signoz instead of Tracetest. If you change it to the tracetest, it should help you, like this:

exporters:
      #..

      otlp/tracetest:
        endpoint: {{your-tracetest-url}}:4317
        tls:
          insecure: true

After changing the collector config, you must restart its deployment on k8s.

Also, one question, if you want it, we can schedule a meeting to help you with this setup and get feedback on how we can improve this integration.

Connect2naga commented 1 year ago

HI @danielbdias , Have tried same before same issue(that is the reason i have commented collector config). In this case we cant see traces in signoz & same timeout failure.

i am in understanding that signoz integration with tracetest, mean both should use same collector or tracetest should read data from signoz

danielbdias commented 1 year ago

hi @Connect2naga,

Yes, in your case, your app should use the OTel collector that has both otlp/tracetest and otlp/signoz config with Tracetest.

Does your app configured to send telemetry directly to signoz-otel-collector.observability.svc.cluster.local:4317? If yes, by setting it to this OTel Collector should help you to see the traces.

Another thing that can help you is that, as you have the logging exporter configured here:

exporters: [otlp/signoz, otlp/tracetest, logging]

You should be able to see any trace span that arrives on that OTel Collector. Do you know if that pod is logging any data?

Connect2naga commented 1 year ago

HI,

[danielbdias] : Yes, in your case, your app should use the OTel collector that has both otlp/tracetest and otlp/signoz config with Tracetest.

In my applications do we need to configure both "signoz-otel-collector.observability.svc.cluster.local:4317" and "tracetest.tracetestdemo.svc.cluster.local:4317"?

[danielbdias] : Does your app configured to send telemetry directly to signoz-otel-collector.observability.svc.cluster.local:4317? If yes, by setting it to this OTel Collector should help you to see the traces.

Yes, i am using signoz-otel-collector.observability.svc.cluster.local:4317, so this collector info configured.


type ServiceConfigurations struct {
    LogLevel      string `envconfig:"LOG_LEVEL" default:"info"`
    Port          string `envconfig:"PORT" default:"8088"`
    OptlCollector string `envconfig:"TRACE_COLLECTOR" default:"signoz-otel-collector.observability.svc.cluster.local:4317"`

    HeaderReadTimeout int
}

image

Also able to see log traces as well(attached in above chat thread).

below is the setup trying to achieve,image

  1. Configure signoz-otel-collector.observability.svc.cluster.local:4317 as collector in applications.
  2. Execute REST api test using tracetest.
  3. Application execute the api request & push the data to collector. no error message in applications while pushing.
  4. Signoz should read the traces & spans and show . : able to see the Traces and spans
  5. tracetest Should complete the test & pass : Execution is completed with API response but failed to execute test.

Could you please help what need to do above senario.

danielbdias commented 1 year ago

hi @Connect2naga !

Thanks for this info. It helped me a lot to understand what was happening.

What is happening today is that you are connecting to OTel Collector inside Signoz infrastructure instead of the OTel Collector outside of it.

On the Gist that I published there is two Collectors:

As the OTel Collector on Signoz has very specific rules for Signoz, in our demo, we decided not to modify it to avoid problems if the Signoz team decides to change something in future updates.

Thinking on that, we defined another OTel Collector to have specific exporters for you on Tracetest (or even for other things, you can see it here). So the architecture for the telemetry is something like this:

flowchart LR
    Applications["My Applications"]
    Tracetest
    OTelCollector["OTel Collector"]
    SignozOTelCollector["Signoz internal OTel Collector"]
    SignozClickHouse["ClickHouse"]
    SignozQueryService["Query Service"]
    SignozFrontend["Frontend"]
    SignozAlertManager["Alert Manager"]

    Applications --> OTelCollector
    OTelCollector --> Tracetest
    OTelCollector --> SignozOTelCollector

    subgraph Signoz
        SignozOTelCollector --> SignozClickHouse
        SignozClickHouse --> SignozQueryService
        SignozQueryService --> SignozAlertManager
        SignozAlertManager --> SignozQueryService
        SignozQueryService --> SignozFrontend
        SignozFrontend --> SignozQueryService
    end

In your case, you are almost there on the architecture: if you change your ServiceConfigurations to connect on tracetest.tracetestdemo.svc.cluster.local:4317 and set up the first OTel Collector with the exporters mentioned here, it should be sufficient.

Feel free to ask any questions about this issue if needed! If you want it, we can schedule a call to help you! I'm also on CNCF Slack: https://communityinviter.com/apps/cloud-native/cncf as "Daniel Dias" if you need to talk.

Connect2naga commented 1 year ago

Hi @danielbdias,

Thanks for reply, So if we use generic Otel or tracetest Specific Collector then tracetest will able to fetch the tracedata .

we need to read this trace and push to Signoz specific collector, do we have any ref points?

Thanks, Connect2naga.