grafana / x-ray-datasource

AWS X-Ray data source
Apache License 2.0
35 stars 12 forks source link

Different begin segment display between AWS xray and plugin #237

Open bendanye opened 3 months ago

bendanye commented 3 months ago

What happened: There are differences in the display between AWS xray and plugin.

What you expected to happen: It should be consistent in displaying regardless from xray or plugin itself

Screenshots

xray plugin

Environment:

idastambuk commented 3 months ago

Hi @bendanye, so far I'm unfortunately unable to reproduce this. Can you tell us if the Node Graph visualization also shows a different segment than ngnix? Also, is the company segment shown anywhere in the graph or segments timeline in X-ray console, or is it just there instead of the nginx segment?

bendanye commented 3 months ago

Hi @idastambuk
The company segment is shown somewhere. From the screenshot you can see company is shown after nginx in the plugin but while xray is showing nginx first

plugin segments xray segments

For the Node Graph visualization in both xray and the plugin is showing the same (i have attached the screenshots)

plugin graph xray graph
idastambuk commented 3 months ago

Hi @bendanye thanks for the screenshots! Looking at them, it seems like there are multiple top-level segments (including __ngninx and _company) and the plugin is just displaying them in a different order (alphabetical?) than the AWS console, is this indeed the issue?

bendanye commented 3 months ago

Hi @idastambuk i think it will be very misleading if did not expand the node graph, it might seem the first request is to company instead of nginx. Also we will associate the top of the segment as the first request.

njvrzm commented 3 months ago

We're still trying to reproduce this - meanwhile, could you provide a bit more detailed information? Specifically, in the trace list where you're seeing the spans out of the expected order, could you check the "Start Time" shown in the span details and let us know if the Company span that's above has a later start time than the nginx span that's below? From your screenshots it looks like its start time is earlier, which could explain why it's being sorted the way it is.

It would also be helpful to know if the parent span is set correctly - you can open the query inspector, go the the Data column, and find the spans that look out of order. Is the Nginx span an ancestor of the Company span?

bendanye commented 3 months ago

could you check the "Start Time" shown in the span details and let us know if the Company span that's above has a later start time than the nginx span that's below

The start time for company is 3.9 while the nginx is 3.8

It would also be helpful to know if the parent span is set correctly - you can open the query inspector, go the the Data column, and find the spans that look out of order. Is the Nginx span an ancestor of the Company span?

This is what i seeing:

query inspector
njvrzm commented 2 months ago

Thanks very much for the details, @bendanye, they're very helpful. I think we have an idea for a fix - I'll discuss with the team later and we'll see about getting this into our backlog.

idastambuk commented 1 month ago

FYI the suggested solution here is to sort top level segments (no parent span) by their startTime, ascending

idastambuk commented 1 month ago

Hi @bendanye Im taking another look at this and I'm having trouble reproducing the wrong sorting. Our Trace View visualization will sort the top level spans according to their startTime, if the start time is in milliseconds, no matter which order they come in from x-ray.

Additionally, this screenshot seems to show that the spans are sorted correctly according to their start time, since the span timeline (the colored lines) seems to go from left to right. I'm wondering if then this is a problem with the response having different data or x-ray console using some other parameters to sort. Image

While looking at the screenshots, it seems like there is data on two separate traces that doesn't match - would it be possible to get this info on one single trace that is still giving you issues:

  1. A screenshot on the top level segments timeline in Grafana's trace view, compared to the segments timeline x-ray console in AWS.
  2. The data (table) view in Grafana, but only traces without a parentSpanId. You can accomplish this by sorting the table by parentSpanId, which will put the empty cells first.

We really appreciate your help with reproducing this!

bendanye commented 1 month ago

Hi @idastambuk

Sure here the recent result

query_inspector service operation
idastambuk commented 1 month ago

Hi @bendanye thanks for the screenshots. Looking at the data, it does seem like our plugin is displaying correctly the data we're getting from x-ray - start_time for lookup root span is before the start time for nginx root span:

Image

It IS strange that the start times for both root spans start AFTER their children spans. Additionally, ngnix nested spans clearly start before lookup spans. I'm not sure if there's anything we could do here, since it seems like the data from x-ray seems to be inconsistent. This could be a problem with instrumentation where the root spans are recorded incorrectly - are you able to double check this?

bendanye commented 1 month ago

Hi @idastambuk, i have checked again and if i expand both nginx and lookup and the start time shows nginx is earlier than lookup.

Image Image

idastambuk commented 1 month ago

Hi @bendanye the start times you show are for children span. As mentioned, the timing is definitely unusual for them, but the root span sorting comes from startTimes for root spans, and that seems to be data coming into the plugin, and not calculated IN the plugin.