Closed huahuayu closed 4 months ago
The repro script doesn't work for me
$ bash run.sh
Creating network "nil-ptr_default" with the default driver
Creating nil-ptr_es01_1 ... done
Creating nil-ptr_collector01_1 ... done
ERROR: for ingestor Container "40464ea0d426" is unhealthy.
ERROR: for query01 Container "40464ea0d426" is unhealthy.
ERROR: Encountered errors while bringing up the project.
(this is after I removed unnecessary copies of ES and Query).
It's also missing Kafka.
What is producing the actual traces in your setup?
What is producing the actual traces in your setup?
I have a program to produce spans & concat them with a trace ID based on our business logic, then I send the spans to jaeger-collector by http.
At first, I use jaeger-all-in-one with memory storage. The container restarted too often, obviously, it can't be used in production.
So I want to use es cluster to persist the data.
It's also missing Kafka
I use aws kafka cluster, so it's not in docker-compose file.
I just checked the code, can we do this?
func (fd FromDomain) convertSpanEmbedProcess(span *model.Span) *Span {
s := fd.convertSpanInternal(span)
if span.Process == nil {
return nil
}
s.Process = fd.convertProcess(span.Process)
s.References = fd.convertReferences(span)
return &s
}
I am willing to help to look into the issue, how can I start jaeger-ingestor
in the local env, so that I can debug.
We already have a fix being worked on in #3819.
I have a program to produce spans & concat them with a trace ID based on our business logic, then I send the spans to jaeger-collector by http.
Can you share this program code? I am still not clear how the nil pointer gets into the data received by the ingester, because the same sanitizer being added in #3819 already exists in the collector. In other words, it should not be possible to have span data in Kafka with Process=nil
.
There are two programs, the first one produces JSON like this, and send to Kafka
{
"header": {
"k0": "v0",
"k1": "v1"
},
"id": "512506985abb01b6",
"parentId": "2a5b6961c59e731f",
"name": "spanName_foo",
"start": 1658247482737891800,
"end": 1658247482737892900,
"metadata": {
"k0": "v0",
"k1": "v1"
}
}
The second one converts it to otel format spans & concat them based on business rules.
Here's the sample json output, all of our production spans is like this https://gist.github.com/huahuayu/cd3ad1ddf076b2892a6c3c1c68a9ca34
The second one converts it to otel format spans & concat them based on business rules.
But how does this get into Jaeger? Do you then take the OTEL spans and submit then to Jaeger Collector's OTLP endpoint?
yes, you are right, sent it to otel-collector by http.
Please check my sample link again, I provided more data, exactly the same production data format( I send spans in batch), and you can do this:
curl --location --request POST 'http://127.0.0.1:4318/v1/traces' \
--header 'Content-Type: application/json' \
--data-raw '<sample.json>'
My data should be no problem, I have confidence in that. If the data constraint breaks, maybe it's been introduced in other jaeger components.
This is what I am getting from your description.
flowchart LR
A[App] --> |Custom\nJSON|Kafka
Kafka --> C[Custom\nConverter]
C --> |OTLP JSON|JC[Jaeger\nCollector]
JC --> |??|Kafka2
Kafka2 --> JI[Jaeger\nIngester]
Btw, when I run your curl command against Jaeger-all-in-one with OTLP enabled, if fails:
$ curl --location --request POST 'http://127.0.0.1:4318/v1/traces' \
--header 'Content-Type: application/json' \
--data-raw ~/Downloads/otel-trace.json
{"code":3,"message":"invalid character '/' looking for beginning of value"}%
Hey @yurishkuro , thanks for the update on the PR and the review!
We have the same setup as @huahuayu and it only happens when we upgraded the Jaeger, ElasticSearch and Kafka.
I believe it went wrong at Jaeger Collector to Kafka step or during the upgrade.
If someone was looking for a workaround - it was purging the messages from the topic on Kafka.
kubectl exec -it -n <jaeger-namespace> jaeger-kafka-0 -- sh
cd /opt/bitnami/kafka/bin
./kafka-configs.sh --bootstrap-server jaeger-kafka:9092 --topic jaeger-spans --alter --add-config retention.ms=1000
./kafka-configs.sh --bootstrap-server jaeger-kafka:9092 --topic jaeger-spans --alter --delete-config retention.ms
So all the fixes we have so far are defensive. Still don't know how messages existing collector->Kafka or entering Kafka->ingested might end up with nil Process. If it's due to data corruption in Kafka, it seems very peculiar that it's always the nil Process that people experience (although it's very possible that because we never checked for nil there, it's just the most obvious symptom). I just merged the fix #3578 that should prevent panics, maybe then someone can investigate what the spans look like when stored, i.e. maybe some other parts of the span are corrupted.
I believe it went wrong at Jaeger Collector to Kafka step or during the upgrade.
@locmai did you notice any unusual logs in the collector during Kafka upgrade? Other than Kafka upgrade corrupting the data already stored (possible, I guess), Collector->Kafka is indeed a likely place, possibly due to the type of driver we're using.
I don't think we've seen anything unusual from Kafka logs during the upgrade. Let me double-check on it tomorrow.
We did the upgrade for 10 environments and 3/10 had the issue, so yes, very likely due to the Kafka upgrade.
@yurishkuro the process flow is right. The data have no problem, your curl should use plain JSON text not the json file path, and it will be no error.
--data-raw '{"key":"v1"}'
@locmai I didn't upgrade any components, I use jaeger 1.35 all the time
This works fine with all-in-one:
$ curl --location --request POST 'http://127.0.0.1:4318/v1/traces' \
--header 'Content-Type: application/json' \
--data-binary @~/Downloads/otel-trace.json
@huahuayu are you saying that sending the same data payload through collector->kafka->ingester pipeline reliably reproduces the panic for you? We need an easier way to setup this test, currently it's only automated via GH action.
@huahuayu also, which encoding of Jaeger spans are you using in Kafka?
The sample json I gave can not reproduce the panic, what I want to say is the original span data have no problem.
which encoding of Jaeger spans are you using in Kafka?
I don't know, it's not json, but it's not profobuf either, it's been automatically created by jaeger I think, the topic name is jaeger-spans, the msg is like
����7N:s��ߑ�&��H[�Y^3calc"
����7N:s��ߑ�&��Y�ۂ�%�2�䄗ך�:��B
tag Bar
type Foo
chainBSCBI
keyB0x7002b9f32644e4133858cf858e028bf63eef60b2deeb2d6fa6cb4d957f7c2aceB
node192.168.1.36B
sourceAppB
span.kindclientB
internal.span.formatprotoR%
Foo
service.namespaceApp
Jaeger only supports JSON and Protobuf, so it has to be the latter.
I just pushed a docker-compose config for running collector->Kafka->ingester pipeline (7006e9fe50c8467ad6b84f2072a3cf136bfbe4ec) and tested with the trace from https://gist.github.com/huahuayu/cd3ad1ddf076b2892a6c3c1c68a9ca34:
$ curl -v --location --request POST 'http://127.0.0.1:4318/v1/traces' \
--header 'Content-Type: application/json' \
--data-binary @otel-trace.json
As I expected, the trace was stored without issues.
So the issue is somewhere in the interaction between Jaeger and Kafka. Since the messages appear to be corrupted when stored in Kafka, my first suspicion is that it happens in the producer (jaeger collector) or the driver during Kafka broker maintenance. Unfortunately, I don't have much to go on without the ability to reproduce the issue.
This could be fixed in v2, but cannot reproduce so closing but also adding the tag.
What happened?
Start jaeger container with ES as data storage, the ingester container keeps down, nil pointer error.
Version 1.35 & 1.36 has the same issue.
Steps to reproduce
Step0: run jaeger with es by container
Step1: At first it works fine, I can see data in jaeger-ui, but then
ingester
container keeps down. Error log:Expected behavior
Everything works.
Relevant log output
No response
Screenshot
No response
Additional context
No response
Jaeger backend version
No response
SDK
No response
Pipeline
No response
Stogage backend
No response
Operating system
No response
Deployment model
No response
Deployment configs
No response