spring-cloud / spring-cloud-sleuth

Distributed tracing for spring cloud
https://spring.io/projects/spring-cloud-sleuth
Apache License 2.0
1.77k stars 782 forks source link

Merged Traces #1937

Closed memaskal closed 2 years ago

memaskal commented 3 years ago

I have a Spring Boot 2.4.5 project with Spring Integration and Spring Cloud 2020.0.2. The application consumes from a JMS queue, validates the request, and uses additionally two queues for further processing. I use the message-driven channel adapter as shown below to read messages from the incoming-requests queue and send them to my channel "received-message-channel". After validation, an internal queue is used (int-jms:channel) to do additional processing before I produce the final output, to another JMS queue using directly the JMS template from a service activator.

This application is part of a pipeline thus I use sleuth to trace the requests. I use the BraveTracer to acquire the current span (created by Sleuth's Channel Interceptor instrumentation for spring-integration) and add custom tags to spans identifying the request type etc...

<int-jms:message-driven-channel-adapter
    connection-factory="connectionFactory"
    destination-name="input-queue"
    channel="received-message-channel"
    concurrent-consumers="1"
    max-concurrent-consumers="5"
    acknowledge="transacted"
    error-channel="errorChannel"
/>

<int-jms:channel
      id="internalQueueChannel"
      queue-name="internal-queue"
      connection-factory="connectionFactory"
      acknowledge="transacted"
      concurrency="1-5"
/>

<bean id="connectionFactory" class="org.springframework.jms.connection.CachingConnectionFactory">
  <property name="targetConnectionFactory">
      <bean class="org.apache.activemq.artemis.jms.client.ActiveMQConnectionFactory">
          <property name="user" value="${spring.artemis.user}" />
          <property name="password" value="${spring.artemis.password}" />
          <property name="brokerURL" value="${spring.artemis.brokerUrl}" />
      </bean>
  </property>
</bean>

Moreover, I use sampling and I only report some of the channels (to avoid all the spans from being reported) using patterns:

  spring.sleuth.sampler.percentage=0.1
  spring.sleuth.integration.patterns=received-message-channel, anotherChannel, finalChannel

The problem that I see using Zipkin/Jaeger UI is a single trace containing almost all of my spans, or in other words, multiple traces merged. As shown in the image below, 7K+ spans are reported under the same trace Id. These spans are nested in arbitrary ways, not even preserving their natural ordering. Some warning logs exist invalid parent span IDs=e0c87fcfe5a4e980; skipping clock skew adjustment, under some of these spans. multiple-spans-under-same-trace

This was the trace's final stats when all spans were reported: Trace StartMay 6 2021, 14:56:17.303 Duration 32m 27s Services 3 Depth 16 Total Spans 9104

marcingrzejszczak commented 3 years ago

Hi! Are you sure that they are reported in an invalid way? What I mean is that if we're producing the traces / spans properly in Sleuth but Jaeger/Zipkin fails to render them accordingly then it's not sth we can fix on our side. Can you double check that the issue is on our side? Also, could you please create a small project that replicates the issue (preferably in Java and Maven) ?

memaskal commented 3 years ago

Hi, I will try to provide a way to replicate the issue, but in the meantime, do you suggest a way to check that the spans are reported correctly? This is why I tried both Zipkin and Jaeger to eliminate UI render problems but in any case, with your suggestion, I can check on that.

marcingrzejszczak commented 3 years ago

You could go to zipkin and download the whole json it received. Then you'd have to analyze if the parent-child relationship is maintained between the reported spans. If there are no spans missing (e.g. There is a span with parent id X but there's no span with id X).

memaskal commented 3 years ago

I think that I traced the issue, with the JMS instrumentation of sleuth. When a JMS message is created from an un-instrumented application (i.e. Jmeter) the issue occurs, whereas when using instrumented applications to produce the JMS messages, it does not happen.

I created an example app here. You can use the Jmeter script to recreate the issue and see that under the same trace, thousands of spans appear or run the Integration test I created which works as expected and each trace has the correct spans under it.

marcingrzejszczak commented 3 years ago

If you start with an uninstrumented application then Sleuth will create a new trace for each such message which makes perfect sense. What would be the expected behaviour in your opinion?

memaskal commented 3 years ago

Yes, this is the expected behavior in my opinion too, but it does not work!

Maybe I was not clear in my previous message. Run the sample app with Jmeter as the JMS producer to reproduce this issue.

memaskal commented 3 years ago

So would it be considered a bug?

marcingrzejszczak commented 3 years ago

I don't know, I haven't yet had time to look into it

marcingrzejszczak commented 2 years ago

I tried running your example. I had to start the artemis activemq instance, then I tried running the jMeter file but nothing happens. When I go to the artemis UI and manually send out messages I get separate traces ids like below (example for 2 messages)

image

I can't replicate this problem, can you ensure that the jmeter file is ok? Can you write a small readme on what I should do to run this example to replicate the problem?

In terms of the number of integration spans - we will be improving that for Spring Framework 6 and Boot 3.

spring-cloud-issues commented 2 years ago

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

spring-cloud-issues commented 2 years ago

Closing due to lack of requested feedback. If you would like us to look at this issue, please provide the requested information and we will re-open the issue.