knative-extensions / eventing-kafka-broker

Alternate Kafka Broker implementation.
Apache License 2.0
185 stars 117 forks source link

Trace ids are mixed up between unrelated events #3913

Open mauricecamphadrian opened 6 months ago

mauricecamphadrian commented 6 months ago

Describe the bug Trace ids appear to be mixed up by some knative component. We roughly have the following setup:

flowchart LR
AppA("App A")
AppB("App B")
Broker("KNative Kafka Broker")
subgraph Sequence
Channel("KNative Kafka Channel")
end

AppA --> Broker
Broker --> Trigger
Trigger --> Channel
Channel --> AppB

For each event sent by AppA a new root trace is started with a unique trace id. This is reflected in the traceparent headers of the events we see in the backing topic for the Kafka broker. However the trace ids we receive in AppB get mixed up according to our observability platform, many spans are being grouped together under a few trace ids. This is also reflected in the backing topic of the Kafka channel, which leads me to believe that some KNative component is causing the issue.

Expected behavior We expect each event to retain its original trace id as it propagates through Knative components.

To Reproduce I have managed to create a minimal example to reproduce this:

apiVersion: eventing.knative.dev/v1
kind: Broker
metadata:
  annotations:
    eventing.knative.dev/broker.class: Kafka
  name: broker
  namespace: default
spec:
  config:
    apiVersion: v1
    kind: ConfigMap
    name: kafka-broker-config
    namespace: knative-eventing
---
apiVersion: flows.knative.dev/v1
kind: Sequence
metadata:
  name: sequence
  namespace: default
spec:
  channelTemplate:
    apiVersion: messaging.knative.dev/v1beta1
    kind: KafkaChannel
    spec:
      numPartitions: 10
      replicationFactor: 1
  steps:
    - ref:
        apiVersion: serving.knative.dev/v1
        kind: Service
        name: app
      uri: /rejectOnce
---
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: app
  namespace: default
spec:
  template:
    spec:
      containers:
        - image: dev.local/printer:1.0.0 # This is a local test image I created that just prints out the event id and trace parent header as they come in.
          imagePullPolicy: Never
      containerConcurrency: 1 # Set container concurrency to 1 to rule out that the mixup is occuring within the service.
---
apiVersion: eventing.knative.dev/v1
kind: Trigger
metadata:
  name: sequence-trigger
  namespace: default
spec:
  broker: broker
  subscriber:
    ref:
      apiVersion: flows.knative.dev/v1
      kind: Sequence
      name: sequence
  1. Create a local kind environment using the development guide and apply the above config (after also installing knative-serving).
  2. Send 20 events with random EventId and random traceparent headers to the knative-eventing broker ingres (http://kafka-broker-ingress.knative-eventing.svc.cluster.local/default/broker)
  3. Compare the event ids + traceparent header that were sent with the values received by the knative service:
event id 01f3241d-b852-45d5-b993-d504849b9291, sent traceparent 00-a76aa4ff977e7ea935275cd5d9e29676-8ad811a32b3fa738-01, received traceparent 00-a76aa4ff977e7ea935275cd5d9e29676-25e4be4a7b2fb3ea-01
event id 03b11df6-f503-441c-a33c-5f2f03f07d3d, sent traceparent 00-7f59a9b0c19f17cf45fe669b1ece4137-66e673d96fdf4ac3-01, received traceparent 00-7f59a9b0c19f17cf45fe669b1ece4137-bd3dcbfcf514ad12-01
event id 0db02aed-f9c2-41e6-a176-2afdf62d5c86, sent traceparent 00-ed09636d307e13c17a3b58afc0b4d175-25132c6425ef1ae2-01, received traceparent 00-1204504a7b59e014605a9a9ee67cb63a-77aff79ac84d4bcf-01
event id 2e6e21d5-0e68-4eec-9a3b-0451a5e7e92c, sent traceparent 00-7dc9619f8aa46e74ef6f75ef902b9a8d-b272a264c55d992e-01, received traceparent 00-7dc9619f8aa46e74ef6f75ef902b9a8d-b1adee7e54d59269-01
event id 38242069-115e-4060-a2a1-19621baeeb7f, sent traceparent 00-7f022f328595f47710d01382e225d1b1-016224cc898d4577-01, received traceparent 00-1204504a7b59e014605a9a9ee67cb63a-2e4e015e7ff57262-01
event id 39409e02-40e5-460f-a5a3-e6c07f6fd56c, sent traceparent 00-93cc8642d4e78bf143b1c5316c767ffd-4aaa4c2fca829bf4-01, received traceparent 00-bb8c75c08b257670cac3cb2b4b36f686-39724274340c1454-01
event id 44234477-d8f4-4716-a45e-8b9013bb861a, sent traceparent 00-d9206db7c69bc3192b90a7f9da070521-2d42ec9a043603af-01, received traceparent 00-d9206db7c69bc3192b90a7f9da070521-7e80a43776e2f1af-01
event id 4471e335-7703-4919-b8d3-470afda5988b, sent traceparent 00-821de555043da05edb4bc8c0db1bb300-085552f9cdd020a8-01, received traceparent 00-821de555043da05edb4bc8c0db1bb300-61af9c76a499633b-01
event id 725cd1e2-4ff5-47c3-a759-35d1cc9cd176, sent traceparent 00-269a4069c490fd03ab550dc10004947c-7dca840f73a26f2f-01, received traceparent 00-269a4069c490fd03ab550dc10004947c-31f7d43db3d63e97-01
event id 76b47330-10d3-410f-b7dc-cb2669cd2b84, sent traceparent 00-1204504a7b59e014605a9a9ee67cb63a-78567095817116e4-01, received traceparent 00-1204504a7b59e014605a9a9ee67cb63a-0f478741092b5efb-01
event id 971221f3-77c0-4934-9cbf-b66ba0daeb56, sent traceparent 00-2de4ece65d315d962067ece13d18e0ae-0940dfa1d118ca34-01, received traceparent 00-bb8c75c08b257670cac3cb2b4b36f686-5902ac017f7a2508-01
event id 9c701ed1-2924-4d53-8ed8-6975b48786b1, sent traceparent 00-21c076235d8645fce69c8d784caffd84-35e3222d0ed3a358-01, received traceparent 00-1204504a7b59e014605a9a9ee67cb63a-baf596162236d5f4-01
event id bfd26072-28aa-4380-b724-4022578331a8, sent traceparent 00-65cc555834b1addee9676e0bbca211b3-92a1aca18384b01a-01, received traceparent 00-1204504a7b59e014605a9a9ee67cb63a-c314f1fbf6164764-01
event id c0d2ad5a-ed08-4585-9a7c-9efb110e6d5f, sent traceparent 00-b48e0b56763f4f2ea39f1e83d5292cbc-ed8e6e04c72c52fd-01, received traceparent 00-b48e0b56763f4f2ea39f1e83d5292cbc-e19f57ed19ace1d0-01
event id c14378af-f06a-46f8-a427-cba94b0c1ab3, sent traceparent 00-8644e43ccfa6e28f7984c21ab4517531-cb690ea8acaaba62-01, received traceparent 00-1204504a7b59e014605a9a9ee67cb63a-9e6fa2a8daf646d3-01
event id c6214d3e-88ef-4734-82e7-941b7609aba2, sent traceparent 00-70274871b312a40dc4cee26127955c18-5f26f15ebe464ff9-01, received traceparent 00-70274871b312a40dc4cee26127955c18-70e09b8cb3914b2c-01
event id e37879a9-7a47-407e-9208-3968af6ad2b0, sent traceparent 00-622500832c6ee74a3aed65d7d731c6aa-1f27220a9b643bb9-01, received traceparent 00-622500832c6ee74a3aed65d7d731c6aa-42ddc08f45827ffb-01
event id e8a1870c-8173-46bc-b28f-223c461b9b7c, sent traceparent 00-48be4d81478e6ab1a5a749fdbf49e974-614779bea1d0ee8c-01, received traceparent 00-48be4d81478e6ab1a5a749fdbf49e974-9912d19b8b2e6c6a-01
event id fb3c262d-dc3c-44a8-b594-7eed168468f9, sent traceparent 00-d28ba0ec48164e35b0454b20c1d7676b-bd941a4aef610d08-01, received traceparent 00-1204504a7b59e014605a9a9ee67cb63a-295fae61f85330ea-01
event id fd734c35-c403-4b97-8b5b-4cf3f870c98d, sent traceparent 00-bb8c75c08b257670cac3cb2b4b36f686-94e242c3606ceb74-01, received traceparent 00-bb8c75c08b257670cac3cb2b4b36f686-f5a274f3909032e9-01

As you can see from my results there are several events that were received with trace id 1204504a7b59e014605a9a9ee67cb63a but only 1 was sent with this trace id.

Knative release version 1.14.1 Reproduction was done using commit 28b46159cf8fe6877756459198c0a8f49c2d6626

Cali0707 commented 6 months ago

cc @pierDipi

pierDipi commented 5 months ago

/assign

pierDipi commented 5 months ago

Send 20 events with random EventId and random traceparent headers to the knative-eventing broker ingres (http://kafka-broker-ingress.knative-eventing.svc.cluster.local/default/broker)

@mauricecamphadrian do you have a script I can run to do that or would you mind providing one please?

mauricecamphadrian commented 5 months ago

I used a small dotnet tool to generate the events since our producing service is also a dotnet service:

Program.cs:

using System.Collections.Concurrent;
using CloudNative.CloudEvents;
using CloudNative.CloudEvents.Http;
using CloudNative.CloudEvents.SystemTextJson;

const string RequestURI = "http://localhost:8080/default/broker";
const int WorkerParallism = 10;
const int RequestsPerWorker = 2;

var bag = new ConcurrentBag<KeyValuePair<Guid, string>>();

await Parallel.ForAsync(0, WorkerParallism, async (i, _) =>
{
    var client = new HttpClient();
    for (var x = 0; x < RequestsPerWorker; x++)
    {
        await MakeRequest(i, client);
    }
});

foreach (var (k, v) in bag)
{
    Console.WriteLine($"Sent event with id {k} and traceparent {v}");
}

async Task MakeRequest(int workerIdx, HttpClient client)
{
    var eventId = Guid.NewGuid();
    var cloudEvent = new CloudEvent
    {
        Data = new(),
        DataContentType = "application/json",
        Id = eventId.ToString(),
        Source = new Uri("tester", UriKind.Relative),
        Time = DateTimeOffset.UtcNow,
        Type = "test.event"
    };

    var traceparent = GenerateRandomTraceParentHeader(out var traceId, out var parentId);

    var content = cloudEvent.ToHttpContent(ContentMode.Binary, new JsonEventFormatter());
    content.Headers.Add("traceparent", traceparent);
    content.Headers.Add("ce-partitionkey", $"{workerIdx}");
    var response = await client.PostAsync(RequestURI, content);

    if (!response.IsSuccessStatusCode)
    {
        Console.WriteLine("[WRN] Unexpected status code received");
    }

    bag.Add(KeyValuePair.Create(eventId, traceparent));
}

static string GenerateRandomTraceParentHeader(out string traceId, out string parentId)
{
    var traceBytes = new byte[16];
    Random.Shared.NextBytes(traceBytes);
    traceId = Convert.ToHexString(traceBytes).ToLowerInvariant();

    var parentBytes = new byte[8];
    Random.Shared.NextBytes(parentBytes);
    parentId = Convert.ToHexString(parentBytes).ToLowerInvariant();

    return $"00-{traceId}-{parentId}-01";
}

Project file:

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net8.0</TargetFramework>
    <ImplicitUsings>enable</ImplicitUsings>
    <Nullable>enable</Nullable>
  </PropertyGroup>

  <ItemGroup>
    <PackageReference Include="CloudNative.CloudEvents" Version="2.7.1" />
    <PackageReference Include="CloudNative.CloudEvents.SystemTextJson" Version="2.7.1" />
    <PackageReference Include="Microsoft.AspNetCore.Mvc" Version="2.2.0" />
  </ItemGroup>

</Project>

I spent some more time trying to find where this seems to be going wrong and I think it happens in the event sending logic in WebClientCloudEventSender.send():

I tried to dive deeper but I'm too unfamiliar with java and Vert.x.

mauricecamphadrian commented 4 months ago

Any update on this?