kubeshop / tracetest

🔭 Tracetest - Build integration and end-to-end tests in minutes, instead of days, using OpenTelemetry and trace-based testing.
https://docs.tracetest.io/
Other
1.06k stars 69 forks source link

Trace not found during polling #4048

Open cb645j opened 1 week ago

cb645j commented 1 week ago

There seems to be an intermittent issue related to polling traces. The test polling is unable to find the trace even though the trace exist. The issue seems to only happen when the trace does not exist yet at the time of the first poll, however even though the trace is created before the polling timeout, the polling fails to find the trace. See below logs and you can clearly see the trace is created however the polling fails to find it.

sending event "Get Test Suite Run" (test) event sent "Get Test Suite Run" (test) 2024/10/10 19:08:17 GET /api/testsuites/smokeTestsSuite/run/13 GetTestSuiteRun 9.350796ms 2024/10/10 19:08:17 queue: received job for run 13 2024/10/10 19:08:17 [TracePoller] Test jettyDmeNukeServer_nuke Run 13: GetTraceByID (traceID 4f2dadf95cd977ccebdd72658cf6d773) error: trace not found 2024/10/10 19:08:17 queue: received job for run 13 2024/10/10 19:08:17 [TracePoller] Requeuing Test Run 13. Current iteration: 1 2024/10/10 19:08:17 queue: received job for run 13 2024/10/10 19:08:17 [TracePoller] Test jettyDmeNukeServer_nuke Run 13: GetTraceByID (traceID 4f2dadf95cd977ccebdd72658cf6d773) error: trace not found 2024/10/10 19:08:17 queue: received job for run 13 2024/10/10 19:08:17 [TracePoller] Requeuing Test Run 13. Current iteration: 2 sending event "Get Test Suite Run" (test) event sent "Get Test Suite Run" (test) 2024/10/10 19:08:18 GET /api/testsuites/smokeTestsSuite/run/13 GetTestSuiteRun 7.228252ms 2024/10/10 19:08:18 queue: received job for run 13 2024/10/10 19:08:18 [TracePoller] Test jettyDmeNukeServer_nuke Run 13: GetTraceByID (traceID 4f2dadf95cd977ccebdd72658cf6d773) error: trace not found 2024/10/10 19:08:18 queue: received job for run 13 2024/10/10 19:08:18 [TracePoller] Requeuing Test Run 13. Current iteration: 3 2024/10/10 19:08:19 queue: received job for run 13 2024/10/10 19:08:19 [OTLP] adding span 85757fbf69b4ce78 to trace 4f2dadf95cd977ccebdd72658cf6d773 2024/10/10 19:08:19 [OTLP] adding span 473d739e3aa1e296 to trace 873d8f6967d98b7e7dedd810d21488cd 2024/10/10 19:08:19 [OTLP] adding span 40324d77cdd38b99 to trace 4f2dadf95cd977ccebdd72658cf6d773 2024/10/10 19:08:19 [OTLP] sorted 2 traces 2024/10/10 19:08:19 [OTLP] creating trace 873d8f6967d98b7e7dedd810d21488cd with 1 spans 2024/10/10 19:08:19 [OTLP] creating trace 4f2dadf95cd977ccebdd72658cf6d773 with 2 spans 2024/10/10 19:08:19 [OTLP] received 2 traces 2024/10/10 19:08:19 [OTLP] processing trace 1/2 traceID 873d8f6967d98b7e7dedd810d21488cd 2024/10/10 19:08:19 [TracePoller] Test jettyDmeNukeServer_nuke Run 13: GetTraceByID (traceID 4f2dadf95cd977ccebdd72658cf6d773) error: trace not found 2024/10/10 19:08:19 queue: received job for run 13 2024/10/10 19:08:19 [OTLP] [873d8f6967d98b7e7dedd810d21488cd] failed to get test run: test run is not awaiting trace nor executing. Actual state: FINISHED 2024/10/10 19:08:19 [TracePoller] Requeuing Test Run 13. Current iteration: 4 sending event "Get Test Suite Run" (test) event sent "Get Test Suite Run" (test) 2024/10/10 19:08:19 GET /api/testsuites/smokeTestsSuite/run/13 GetTestSuiteRun 6.784542ms 2024/10/10 19:08:19 queue: received job for run 13 2024/10/10 19:08:19 [TracePoller] Test jettyDmeNukeServer_nuke Run 13: GetTraceByID (traceID 4f2dadf95cd977ccebdd72658cf6d773) error: trace not found 2024/10/10 19:08:19 queue: received job for run 13 2024/10/10 19:08:19 [TracePoller] Requeuing Test Run 13. Current iteration: 5 2024/10/10 19:08:20 queue: received job for run 13 2024/10/10 19:08:20 [TracePoller] Test jettyDmeNukeServer_nuke Run 13: GetTraceByID (traceID 4f2dadf95cd977ccebdd72658cf6d773) error: trace not found 2024/10/10 19:08:20 queue: received job for run 13 2024/10/10 19:08:20 [TracePoller] Requeuing Test Run 13. Current iteration: 6 sending event "Get Test Suite Run" (test) event sent "Get Test Suite Run" (test)

mathnogueira commented 1 week ago

What tracing backend are you using? Jaeger, Tempo, or sending traces directly to Tracetest?

cb645j commented 1 week ago

Im using the tracetest backend, sending traces directly to tracetest.. As I workaround I can increase the initial retry delay to 5 seconds to give time for the traces and all spans to be sent to tracetest before starting the initial polling. However, in theory, I dont think we should have to do that because it defeats the whole purpose of having polling.

mathnogueira commented 1 week ago

I have seen that happen before. The problem was that the triggered system was ignoring the trace context propagation headers and this was making the app generate another Trace-ID for the trace.

A way of debugging is problem is sending your traces to jaeger and see if the trace you get has the same ID in your test run. If they don't match, probably your application is ignoring context propagation. Could you try that?

cb645j commented 1 week ago

ya, i can give that a try