superseriousbusiness / gotosocial

Fast, fun, small ActivityPub server.
https://docs.gotosocial.org
GNU Affero General Public License v3.0
3.68k stars 311 forks source link

[bug] OpenTelemetry Trace hierarchy broken #3117

Open genofire opened 2 months ago

genofire commented 2 months ago

Describe the bug with a clear and concise description of what the bug is.

For Trace-ID is "X-Request-Id" used.

But OpenTelemetry and W3C use the "Traceparent"-Header for documentation see here:

the content is: <version>-<trace-id>-<parent-span-id>-<trace-flag>

where the id has to be strapped.

What's your GoToSocial Version?

0.16.0

GoToSocial Arch

amd64 container

What happened?

Tracing not mapped with loadbalancer

What you expected to happen?

Tracing not mapped with loadbalancer

How to reproduce it?

No response

Anything else we need to know?

No response

Tsuribori commented 1 month ago

There's a request-id-header config option that controls what incoming header the request ID is extracted from https://docs.gotosocial.org/en/latest/configuration/observability/#settings. Would setting request-id-header: "Traceparent" work?

genofire commented 1 month ago

you are correct it works:

image(6) so we could close that issue.

but why is the SQL-Query span not associated by the parent span (the request)? (run they parallel or sequential?)

Tsuribori commented 1 month ago

Checked my traces and it seems that all the SQL query spans have a parentSpanId value that is non-existent. Also there are traces that only contain a single SQL span with a non-existent parent span that should probably be associated with a request. So there is definitively something wrong.

Tsuribori commented 1 month ago

I tested with the testrig DEBUG=1 GTS_PORT=8080 ./gotosocial testrig start used for development and tracing works there correctly, the SQL spans are child spans of the request.

Tsuribori commented 1 month ago

I tried running on my "production" setup with the env vars:

OTEL_TRACES_SAMPLER: "traceidratio"
OTEL_TRACES_SAMPLER_ARG: "0.01"
OTEL_BSP_MAX_QUEUE_SIZE: "10000"

to see if the problem was caused by the queue filling up but spans were still missing like before so I guess it's something else. I guess the testrig and a production server have some sort of difference in the way they function or are set up?

genofire commented 1 month ago

I just run it with (with an plain tempo):

GTS_TRACING_ENABLED: "true"                                                                                                                                                                                                                    
GTS_TRACING_ENDPOINT: tempo.monitoring.svc:4317                                                                                                                                                                                                
GTS_TRACING_INSECURE_TRANSPORT: "true"                                                                                                                                                                                                         
GTS_TRACING_TRANSPORT: grpc  

do you run the collector: OTEL_*

Tsuribori commented 1 month ago

I just run it with (with an plain tempo):

GTS_TRACING_ENABLED: "true"                                                                                                                                                                                                                    
GTS_TRACING_ENDPOINT: tempo.monitoring.svc:4317                                                                                                                                                                                                
GTS_TRACING_INSECURE_TRANSPORT: "true"                                                                                                                                                                                                         
GTS_TRACING_TRANSPORT: grpc  

do you run the collector: OTEL_*

Gotosocial uses opentelemetry-go for traces which has some settings that can be set through OTEL_* environment variables, documented in https://pkg.go.dev/go.opentelemetry.io/otel/sdk/trace

I also use Tempo with the following configuration:

GTS_TRACING_ENABLED: "true"
GTS_TRACING_TRANSPORT: "http"
GTS_TRACING_ENDPOINT: "<redacted>:4318"
GTS_TRACING_INSECURE_TRANSPORT: "true"