openzipkin / zipkin-aws

Reporters and collectors for use in Amazon's cloud
Apache License 2.0
69 stars 34 forks source link

Use local service name as X-Ray subsegment name #88

Closed robyf closed 6 years ago

robyf commented 6 years ago

With reference to https://github.com/openzipkin/zipkin-aws/issues/87 this PR adds the possibility to use local service names as segmente name for span of type different than SERVER and CONSUMER. By default "unknown" is still sent if the remote service name is null, only if the AWS_XRAY_USE_LOCAL_SPAN_FOR_MISSING_REMOTES environment variable is set to true the local service name is used.

As an example (using spring-boot 2 based Java applications) by default these spans:

[{
  "traceId": "5b07b67fed3a0b4570cbdfe07b779e32",
  "parentId": "f3d698d7b353651f",
  "id": "164d63874559fbdf",
  "kind": "CLIENT",
  "name": "get",
  "timestamp": 1527232127973431,
  "duration": 22109,
  "localEndpoint": {
    "serviceName": "security-gateway-aws",
    "ipv4": "192.168.0.171"
  },
  "tags": {
    "http.method": "GET",
    "http.path": "/test-app/v1/public"
  }
}, {
  "traceId": "5b07b67fed3a0b4570cbdfe07b779e32",
  "parentId": "02e913691d305aa0",
  "id": "f3d698d7b353651f",
  "name": "hystrix",
  "timestamp": 1527232127972476,
  "duration": 23595,
  "localEndpoint": {
    "serviceName": "security-gateway-aws",
    "ipv4": "192.168.0.171"
  }
}, {
  "traceId": "5b07b67fed3a0b4570cbdfe07b779e32",
  "parentId": "70cbdfe07b779e32",
  "id": "02e913691d305aa0",
  "kind": "CLIENT",
  "name": "get",
  "timestamp": 1527232127972240,
  "duration": 23838,
  "localEndpoint": {
    "serviceName": "security-gateway-aws",
    "ipv4": "192.168.0.171"
  },
  "tags": {
    "http.method": "GET",
    "http.path": "/test-app/v1/public"
  }
}, {
  "traceId": "5b07b67fed3a0b4570cbdfe07b779e32",
  "id": "70cbdfe07b779e32",
  "kind": "SERVER",
  "name": "get",
  "timestamp": 1527232127970111,
  "duration": 29131,
  "localEndpoint": {
    "serviceName": "security-gateway-aws",
    "ipv4": "192.168.0.171"
  },
  "remoteEndpoint": {
    "ipv4": "127.0.0.1",
    "port": 33060
  },
  "tags": {
    "http.method": "GET",
    "http.path": "/test-app/v1/public"
  }
}]

result in these X-Ray segments:

{
  "trace_id": "1-5b07b67f-ed3a0b4570cbdfe07b779e32",
  "parent_id": "f3d698d7b353651f",
  "id": "164d63874559fbdf",
  "type": "subsegment",
  "namespace": "remote",
  "name": "unknown",
  "start_time": 1.527232127973431E9,
  "end_time": 1.52723212799554E9,
  "http": {
    "request": {
      "method": "GET"
    }
  },
  "annotations": {
    "operation": "get",
    "http_path": "/test-app/v1/public"
  }
}

{
  "trace_id": "1-5b07b67f-ed3a0b4570cbdfe07b779e32",
  "parent_id": "02e913691d305aa0",
  "id": "f3d698d7b353651f",
  "type": "subsegment",
  "name": "unknown",
  "start_time": 1.527232127972476E9,
  "end_time": 1.527232127996071E9
}

{
  "trace_id": "1-5b07b67f-ed3a0b4570cbdfe07b779e32",
  "parent_id": "70cbdfe07b779e32",
  "id": "02e913691d305aa0",
  "type": "subsegment",
  "namespace": "remote",
  "name": "unknown",
  "start_time": 1.52723212797224E9,
  "end_time": 1.527232127996078E9,
  "http": {
    "request": {
      "method": "GET"
    }
  },
  "annotations": {
    "operation": "get",
    "http_path": "/test-app/v1/public"
  }
}

{
  "trace_id": "1-5b07b67f-ed3a0b4570cbdfe07b779e32",
  "id": "70cbdfe07b779e32",
  "name": "security-gateway-aws",
  "start_time": 1.527232127970111E9,
  "end_time": 1.527232127999242E9,
  "http": {
    "request": {
      "method": "GET"
    }
  },
  "annotations": {
    "operation": "get",
    "http_path": "/test-app/v1/public"
  }
}

while with the environment variable set these spans:

[{
  "traceId": "5b07c4dd8167bc728c9369499184df04",
  "parentId": "2561d4b051f72922",
  "id": "0d32b6ca15c75375",
  "kind": "CLIENT",
  "name": "get",
  "timestamp": 1527235805895055,
  "duration": 6568,
  "localEndpoint": {
    "serviceName": "security-gateway-aws",
    "ipv4": "192.168.0.171"
  },
  "tags": {
    "http.method": "GET",
    "http.path": "/test-app/v1/public"
  }
}, {
  "traceId": "5b07c4dd8167bc728c9369499184df04",
  "parentId": "8c9369499184df04",
  "id": "7b25c19865b5ee2c",
  "kind": "CLIENT",
  "name": "get",
  "timestamp": 1527235805888945,
  "duration": 19229,
  "localEndpoint": {
    "serviceName": "security-gateway-aws",
    "ipv4": "192.168.0.171"
  },
  "tags": {
    "http.method": "GET",
    "http.path": "/test-app/v1/public"
  }
}, {
  "traceId": "5b07c4dd8167bc728c9369499184df04",
  "parentId": "7b25c19865b5ee2c",
  "id": "2561d4b051f72922",
  "name": "hystrix",
  "timestamp": 1527235805890880,
  "duration": 21261,
  "localEndpoint": {
    "serviceName": "security-gateway-aws",
    "ipv4": "192.168.0.171"
  }
}, {
  "traceId": "5b07c4dd8167bc728c9369499184df04",
  "id": "8c9369499184df04",
  "kind": "SERVER",
  "name": "get",
  "timestamp": 1527235805883133,
  "duration": 29543,
  "localEndpoint": {
    "serviceName": "security-gateway-aws",
    "ipv4": "192.168.0.171"
  },
  "remoteEndpoint": {
    "ipv4": "127.0.0.1",
    "port": 42386
  },
  "tags": {
    "http.method": "GET",
    "http.path": "/test-app/v1/public"
  }
}]

result in these X-Ray segments:

{
  "trace_id": "1-5b07c4dd-8167bc728c9369499184df04",
  "parent_id": "2561d4b051f72922",
  "id": "0d32b6ca15c75375",
  "type": "subsegment",
  "namespace": "remote",
  "name": "security-gateway-aws",
  "start_time": 1.527235805895055E9,
  "end_time": 1.527235805901623E9,
  "http": {
    "request": {
      "method": "GET"
    }
  },
  "annotations": {
    "operation": "get",
    "http_path": "/test-app/v1/public"
  }
}

{
  "trace_id": "1-5b07c4dd-8167bc728c9369499184df04",
  "parent_id": "8c9369499184df04",
  "id": "7b25c19865b5ee2c",
  "type": "subsegment",
  "namespace": "remote",
  "name": "security-gateway-aws",
  "start_time": 1.527235805888945E9,
  "end_time": 1.527235805908174E9,
  "http": {
    "request": {
      "method": "GET"
    }
  },
  "annotations": {
    "operation": "get",
    "http_path": "/test-app/v1/public"
  }
}

{
  "trace_id": "1-5b07c4dd-8167bc728c9369499184df04",
  "parent_id": "7b25c19865b5ee2c",
  "id": "2561d4b051f72922",
  "type": "subsegment",
  "name": "security-gateway-aws",
  "start_time": 1.52723580589088E9,
  "end_time": 1.527235805912141E9
}

{
  "trace_id": "1-5b07c4dd-8167bc728c9369499184df04",
  "id": "8c9369499184df04",
  "name": "security-gateway-aws",
  "start_time": 1.527235805883133E9,
  "end_time": 1.527235805912676E9,
  "http": {
    "request": {
      "method": "GET"
    }
  },
  "annotations": {
    "operation": "get",
    "http_path": "/test-app/v1/public"
  }
}
codefromthecrypt commented 6 years ago

thanks. maybe a unit test to make sure no one breaks this later?

robyf commented 6 years ago

I added the test now. One thing related to that: I had to add a test dependency to be able to mock system libraries as described here: https://stackoverflow.com/a/35393623. I hope that's ok!

robyf commented 6 years ago

I have to admit I don't understand how that test failed :( I haven't touched anything related to SQS publishing. Just to be on the safe side I cloned the repo in another machine and ran the build there, all tests passed (as they were passing before I pushed).

devinsba commented 6 years ago

That SQS test has been flaking for a month or so. Neither I nor @llinder have had time to fix it it seems

devinsba commented 6 years ago

I haven't seen this on the graph yet myself actually, we haven't traced anything that we didn't have end to end. (other than AWS SDK, but that sets the remote endpoint correctly)

I honestly think I'd prefer to see unknown in the graph, over using the local service name, as that would indicate to me that I have some more code to instrument. I understand not everyone is doing greenfield development so I get huge luxury of knowing all of my applications external touchpoints

robyf commented 6 years ago

@devinsba That's why I left it optional, as it all depends by the context one is in. For greenfield development or when you're in full control of the production of traces and spans it makes sense to immediately see if you have missed something, but when delegating the instrumentation and report to Zipkin to libraries like Sleuth you just have to accept the way they decided to report (or try to send PRs there).

codefromthecrypt commented 6 years ago

sleuth we have control over. all the folks replying here contribute to it. let's focus on fixing the problem! by raising issues clearly about the misses in sleuth.

robyf commented 6 years ago

I will go on testing and report here my findings. So far two things:

  1. X-Ray mandates segment ids inside traces to be unique while Sleuth/Zipkin expects the CLIENT span from one services and the SERVER span from the called service to share the same span id. I don't know how to easily solve this.

  2. When using zuul (we use it in our security gateway) every incoming request forwarded to services produces 2 CLIENT spans in the trace (as in the example I posted before). I believe one is the ribbon command and the second is the actual call to the service.

codefromthecrypt commented 6 years ago

1.

X-Ray mandates segment ids inside traces to be unique while Sleuth/Zipkin expects the CLIENT span from one services and the SERVER span from the called service to share the same span id. I don't know how to easily solve this.

add this to your application properties

don't use the same span ID on the client and server side of an RPC

spring.sleuth.supports-join=false

1.

When using zuul (we use it in our security gateway) every incoming request forwarded to services produces 2 CLIENT spans in the trace (as in the example I posted before). I believe one is the ribbon command and the second is the actual call to the service.

open an issue in sleuth to see if we can work around the instrumentation miss? A client span shouldn't be a child of a client span even if zipkin's server code permits this, we can't necessarily make others (like amazon or google) do the same.

robyf commented 6 years ago

Adding the supports-join=false property has improved the situation a lot. This is what I'm getting now (using my local fork of zipkin-aws but with the "use local service name as remote" disabled): selection_009

Using span name for custom spans seems to do the trick (see the hystrix span in the screenshot).

codefromthecrypt commented 6 years ago

I think marcin adjusted the ribbon stuff so to not create duplicate remote spans. Can you verify this? If so I'd like to keep your changes except the special-cased flag..

robyf commented 6 years ago

Yes that has been fixed and verified. There's still one thing I'd like to try for those remote calls without remote service in the span: use the span name as segment name and not mark that as remote. As the X-Ray documentation talks about using the name of the function called that would maybe make more sense than using the local service name and mark the segment as remote, but I don't know how does it play out with client responses and such, so I've to try.

The best of course would be that the spans would contain the remote service section, I don't know how many places in sleuth would have to be changed to add that to the instrumentation.

codefromthecrypt commented 6 years ago

I don't think it is reasonable to expect the "remote service names" for all client spans in any zipkin instrumentation. For example, it is valid to not know the name of the downstream service. We don't want to encourage putting "dummy values" here as that makes the service functionality worse. Also worse would be losing the signal that it was an RPC call (kind.client)

Of course, if something was incorrectly marked as client, that's a bug... just we don't want to scrub the client just to make the service diagram believe there is no missing data.

I'd suggest to use sleuth's span adjuster for the things unsure about (like scrubbing the span.kind) to basically decouple pressure here http://cloud.spring.io/spring-cloud-static/spring-cloud-sleuth/2.0.0.RC2/single/spring-cloud-sleuth.html#_customization_of_reported_spans

It seems about time to get some clarifications from amazon (beyond documentation), too. Could you possibly raise a support ticket on their side about the mapping when you don't know the end service name?

robyf commented 6 years ago

I have contacted now the AWS Business Support explaining the issue, won't make any code change until I get a reply from them. Usually they reply in 24 hours.

codefromthecrypt commented 6 years ago

thx!

robyf commented 6 years ago

It took a while but I got this reply from AWS:

X-Ray does not have zipkin's concept of client vs server spans. It does have the concept of remote segments and inferred segments. In the github link for the issue #88 ( https://github.com/openzipkin/zipkin-aws/pull/88 ), there is discussion about how to name the server spans when the caller does not know the downstream dependency and your library is naming the span as "unknown". Please refer to the following document for details about how the segments are named: https://github.com/awsdocs/aws-xray-developer-guide/blob/master/doc-source/xray-sdk-java-httpclients.md

When making a downstream call to a service the segment may include a child subsegment with any name and an ID to show the Client side experience (errors, latency). Typically the SDK populates the CNAME of the downstream service in the subsegment name field. The downstream service generates a new segment with the parent_id set to the id of the previous subsegment. The name of this segment is used as the name of the service graph node.

If the downstream service does not send any segment with the parent id set to the id of the caller subsegment then an inferred segment is created for the remote subsegment with the name of the segment set to the name of the subsegment. These inferred segments may result into new nodes in the service graph with the name of the subsegment. Although since these nodes are inferred, the statistics of the node will be inferred from the edges.

In this case, please let us know why are you sending subsegments with name set to "unknown" and namespace as "remote". The namespace is correct as long as that indicates a RPC. It is not clear why the name of the subsegment is not to the correct name of the downstream service and is being referring to as "unknown".

Looking in the link: https://github.com/openzipkin/zipkin-aws/pull/59, that has an update by @adriancole which reads: "Usually span name isn't null, but it could be sent as null when updating a span post factum", please note that segments or subsegments in X-Ray cannot be updated once they have been sent. Are you trying to update the segment with more information and hence sending an "unknown" name for the same id? If so we would like to understand why it is not possible to send the correct name?

codefromthecrypt commented 6 years ago

it is probably mutually difficult to follow these conversations ( I mean from AWS and non AWS point of view ).

The various statements about remote segments and inferred segments elude to an absence of intermediate or in-process spans, and absence of spans that originate without an incoming call. This will happen a lot.. there will definitely be intermediate spans in routine tracing.

Ex A calls B, but along the way there are several sub-operations in-between. or A calls B, but it wasn't due to an incoming call, rather a schedule or a circumstance of bootstrap.

So there's one thing to clarify, which is the presence of intermediate spans, representing local activity, sometimes that local activity is the root span.

On the other hand there's a strange circumstance of an assumption that you always know the name of a remote downstream. Such an assumption is so unrealistic I feel it is more likely we just aren't on the same page at all. I'm cc'ing abhishek maybe we can find a more direct way to collaborate as cadence like this isn't likely to resolve things quickly.

@abhiksingh, I will be in Boston prior to a meeting 9th July. Any chance we can do a video call or even better something in-person to run through various interop things between Zipkin and AWS X-Ray? cc also @devinsba who is timezone close for such a thing.

robyf commented 6 years ago

AWS is asking to have a phone call with me, but I start feeling I'll be just the man-in-the-middle here adding little if not no value at all. Plus the timing will be inconvenient because I'm located in Finland and the AWS X-Ray team somewhere in the PST time zone. Could I suggest AWS to have some kind of video calls with you guys instead?

codefromthecrypt commented 6 years ago

yeah this was my suggestion. i am in Malaysia but on july 9 i am in boston much closer. can certainly do something meanwhile too like next week end of their day. you can have someone reach out directly if july is too late.

appreciated!

robyf commented 6 years ago

I just replied them asking to have the call done between you and them directly. Hopefully they'll be ok with it!

robyf commented 6 years ago

The AWS X-Ray service team is ok to have the call with you, could you pass me your contact details? I'll forward those to the support contact.

codefromthecrypt commented 6 years ago

Abhishek made a meeting already. I created a google doc to put some key questions/discussion points in

https://docs.google.com/document/d/1ZFg2SX2DUJFRE8MpodV9DL9uYgNHe-90PMQIb1DrY8k/edit

On Wed, Jun 27, 2018 at 12:38 PM, robyf notifications@github.com wrote:

The AWS X-Ray service team is ok to have the call with you, could you pass me your contact details? I'll forward those to the support contact.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openzipkin/zipkin-aws/pull/88#issuecomment-400540424, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD61whAy1MHtbRDWb9cWI-cfQTVhqTKks5uAwxggaJpZM4UNlNX .

codefromthecrypt commented 6 years ago

plan resulting from feedback from meeting with AWS

For the remote subsegment name, use a fallback model:

When users report a problem, tell them to configure a valid remote service name (possibly by decorating a reporter if there's no other way).

Also tell users to fix or ask for fixes when there is double-instrumentation (client child of client) as this is a problem and affects statistics. Meanwhile, Amazon will look into skipping bad data when rendering the service graph on their end.