openzipkin / zipkin

Zipkin is a distributed tracing system
https://zipkin.io/
Apache License 2.0
16.93k stars 3.08k forks source link

Should we advise recording http.host by default? If so, client server or both? #2167

Open codefromthecrypt opened 6 years ago

codefromthecrypt commented 6 years ago

Right now, http.method, http.path and http_status.code (when not success) are typically added by default in zipkin tracers. This limits the size (cost) of network bandwidth, storage, indexing and lookup especially as spans are recorded at high rates. Should we advise tracers add http.host also? If so should it include the port? Should we do it both client and server side?

Details

http.host (as we define it), is an alternative to IP lookup, just the domain portion. Some libraries (I don't know all of them, but for example armeria census add http.host by default.

We've had @golonzovsky request for this to be added by default so that he doesn't have to change instrumentation policy. This can make sense, but we have to be careful as for example how host is defined will impact our ability to search and also add indexing cost to everyone.

For example, we already record port as endpoint.port. By not also including port in http.host, we avoid some cardinality concerns and some search concerns as well (ex http.host=foo won't match http.host=foo:80). While people expect DNS names to be stable there's certainly a chance hostnames can be mostly random instance names. This would add indexing load to everyone, in a way that is different than http.method which is fixed except weird protocols.

I think it would be good advice to at least parse http.host as we define it, ex not including redundant port info, but data policy should be widely considered. I'll cherry pick some folks for feedback.. I don't want to scare people about data cardinality.. ex we already record http.path, and that has unlimited cardinality in a way worse than http.host. If we decide to do this, I think we should also be parsimonious as possible.. For example, is this something we could only add to either the client or server side and not both by default?

For those who want http.host, or are recording it already, please mention your use case!

Thanks much cc @openzipkin/core @drolando @wu-sheng @marcingrzejszczak @jkschneider @jettify @cppKin @ellispritchard @chris-vale-ck @mosesn @jplock @ivantopo @nicmunroe @james-callahan @chuan-yun @shimamoto @hyleung @bogdandrutu @trustin @anuraaga

trustin commented 6 years ago

+1 for avoiding redundant information, i.e. no port in http.host. If a user insists to have a port, we could define an optional property like http.authority which is clearer.

james-callahan commented 6 years ago

The Host header may contain a port that does not match the port of the request; e.g. if it has gone through a load balancer. If you add a http.host field, I think it should contain the full contents of the host (or :authority) header as sent/received by/from the client.

jcchavezs commented 6 years ago

Mixed feelings here. In server side you do not always know the host whereas in client side host can be a domain, an IP or just a name (if in a docker compose network). What I would prefer as a user is to have an boolean option which is true when I want to record the host and false when I don't both in server and client side instrumentation.

In any case, being explicit about host not including the port is important.

Does that make sense?

codefromthecrypt commented 6 years ago

hi JC so this is http specific, so in addition to the endpoint details. When talking about a tag usually we dont have interdependencies such as this flag means interpret a certain way.

my advice would be to stick with http host or authority as they are defined in http semantics.. then decide whether or not it should be standard or not (ex recorded by default). this would simplify the issue I think.

james-callahan commented 6 years ago

hi JC so this is http specific

I would expect so from the http. prefix. The Host header is also HTTP specific. I don't see the issue/question?

jcchavezs commented 6 years ago

@adriancole I should have been more clear. I meant that having a flag in the options of the middleware that says recordHost: true|false so user could decide whether it is meaningful or not to record such information.

In all setup I have been dealing with (20 services) http.host is not so relevant because in client side it is usually serviceName what you check. In server side it can be anything (IP, AWS instance id or localhost). I see this tag in both cases as optional.

codefromthecrypt commented 6 years ago

@adriancole https://github.com/adriancole I should have been more clear. I meant that having a flag in the options of the middleware that says recordHost: true|false so user could decide whether it is meaningful or not to record such information

ah ok. well I see your point now. hmmm I guess if we decided to not make it standard then people will add it anyway like they do today or we can do something like you mention to make it easier. otoh if we were going the route of not recording by default I might be more inclined to list the "standard tags" desired vs a flag for each. this could be programmatic or declarative like http.client.tags=http.method,http.path,http.host I think some people maybe python does this already.

at any rate suspect we should get a sense of who wants the default to include host or more importantly who doesn't. I think we can sort a nice way our regardless of the decision.

jcchavezs commented 6 years ago

Yeah that sounds reasonable. For example, in Go http middleware the http.response_size is not recorded by default (because it the response body is a buffer and if you want to measure it you gotta read it and that drains the buffer) so I see this per-tag option comming. recordedTags:... sounds reasonable to me.

Den lør. 11. aug. 2018, 11:53 skrev Adrian Cole notifications@github.com:

@adriancole https://github.com/adriancole I should have been more clear. I meant that having a flag in the options of the middleware that says recordHost: true|false so user could decide whether it is meaningful or not to record such information

ah ok. well I see your point now. hmmm I guess if we decided to not make it standard then people will add it anyway like they do today or we can do something like you mention to make it easier. otoh if we were going the route of not recording by default I might be more inclined to list the "standard tags" desired vs a flag for each. this could be programmatic or declarative like http.client.tags=http.method,http.path,http.host I think some people maybe python does this already.

at any rate suspect we should get a sense of who wants the default to include host or more importantly who doesn't. I think we can sort a nice way our regardless of the decision.

— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/openzipkin/zipkin/issues/2167#issuecomment-412264607, or mute the thread https://github.com/notifications/unsubscribe-auth/AC7sAjbbaChpxE_HOY3VBce3zk4SX4Phks5uPqmsgaJpZM4V49h7 .

--

José Carlos

drolando commented 6 years ago

EDIT: Ignore this.

What's the advantage over endpoint.ip? Just that you get a human readable string?

I'm not sure how kubernetes works, but in our paas the hostname is a combination of host ip and docker container id. Which means it's very high cardinality and so indexing would probably be very expensive. And I'd assume this problem to get more common as more companies move to containerized services.

That's the main reason why we don't add the hostname as tag right now.

Sent from my iPhone

On Aug 11, 2018, at 12:06 PM, José Carlos Chávez notifications@github.com wrote:

Yeah that sounds reasonable. For example, in Go http middleware the http.response_size is not recorded by default (because it the response body is a buffer and if you want to measure it you gotta read it and that drains the buffer) so I see this per-tag option comming. recordedTags:... sounds reasonable to me.

Den lør. 11. aug. 2018, 11:53 skrev Adrian Cole notifications@github.com:

@adriancole https://github.com/adriancole I should have been more clear. I meant that having a flag in the options of the middleware that says recordHost: true|false so user could decide whether it is meaningful or not to record such information

ah ok. well I see your point now. hmmm I guess if we decided to not make it standard then people will add it anyway like they do today or we can do something like you mention to make it easier. otoh if we were going the route of not recording by default I might be more inclined to list the "standard tags" desired vs a flag for each. this could be programmatic or declarative like http.client.tags=http.method,http.path,http.host I think some people maybe python does this already.

at any rate suspect we should get a sense of who wants the default to include host or more importantly who doesn't. I think we can sort a nice way our regardless of the decision.

— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/openzipkin/zipkin/issues/2167#issuecomment-412264607, or mute the thread https://github.com/notifications/unsubscribe-auth/AC7sAjbbaChpxE_HOY3VBce3zk4SX4Phks5uPqmsgaJpZM4V49h7 .

--

José Carlos — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

anuraaga commented 6 years ago

I believe this is about adding HTTP host (e.g. zipkin.io) which is generally just one string, not machine hostname which will be one of many load balanced machines and thus high cardinality.

For reference, in Kubernetes, you usually refer to other services as something like my-cool-service.svc.less-cool-namespace.cluster.local. Now that I think about it, if I could convert this into the service name (my-cool-service), I wouldn't need to set localServiceName to Tracing all the time which would be nice.

drolando commented 6 years ago

@anuraaga you're right, I misunderstood the description.

wu-sheng commented 6 years ago

IMO, I prefer the http.host on the client side, which has more meanings. The client-side host is real and makes sense for developer or ops people.

On the other hand, I am not sure the meaning of server-side host.

anuraaga commented 6 years ago

I think there are many systems where the host is the only way, or the only practical way, for a request to reach a server. In such a case I guess the host has meaning in a server-side context too.

Really though, it means that host doesn't provide additional value over the service name where it's generally semantically equivalent - I don't know if I'd recommend tagging http.host but would possibly recommend instrumentation provide an easy way to convert host to service name.

ellispritchard commented 6 years ago

FYI

When annotating incoming requests tapper_plug records the Plug.Conn.host field, which is purely the domain name part, as the http.host annotation. This is decoded by the underlying adaptor, typically cowboy in the Elixir world, from the Host header.

What gets annotated on outgoing HTTP client requests is entirely up to the user of the HTTP client, since the library provides no integrations for any particular clients, nor any helpers (e.g. for translating a URL into annotations).

NB the Host header seen by pods in Kubernetes is under the control of the ingress controller, and therefore will vary, depending on how the ingress controller is configured, i.e. you might see the external service host, but you might not, and indeed the header could even be dropped entirely (though this is not valid HTTP 1.1). Equally, the ingress controller may be configured to send a useful X-Forwarded-For-Host header, or it may not. Hopefully something useful ends up in the request, that can be used for the http.host annotation, but it is unlikely to be consistent across tracer clients, especially across system boundaries.

golonzovsky commented 6 years ago

I would vote for client side http.host, as it would less likely to explode and makes client span complete and easier to reason about. This is especially relevant for calls to external APIs or uninstrumented endpoints.

Need to check kubernetes ingress controller to see if it forwards requests to services (which would be stable as well)..

beckje01 commented 6 years ago

At SmartThings we could use the host for client side. I think it would be best as a default but regardless I will end up adding it soon as something we track a few reasons I feel we need it:

codefromthecrypt commented 4 years ago

We've recently had this concern again in Brave, where client-side host/authority tag could be handy https://github.com/openzipkin/brave/issues/950

I'd kind of like to just land this soon as it seems most agree a client-side tag referring to this data is already in use. So, this is more a formalization effort.

We could consider a tag "server.host/name", which could be more general (for RPC and MySQL) too...

This wouldn't prevent an additional/alternative tag for "http.authority/host", just be more general purpose for CLIENT/PRODUCER spans (for RPC and MySQL) too.

Using the word "server" allows people to see this is the client point of view, and could help avoid data problems where otherwise someone might tag a SERVER span incorrectly.

While "server.host" or "server.name" isn't perfect (ex it should be "broker.host" in messaging), I think that partial nuance can be cleared up in docs.

For value validation, we can say that this is not an IP address, rather to use remoteEndpoint for that. However, we'd probably end up with port info here, as that's a natural component of URL authority.

Thoughts?

codefromthecrypt commented 4 years ago

To punt a specific plan of action. 👍 this comment if you want the following. reply back if you want something different instead!

define "server.host" tag with only the host (not the port part) as defined on the client side. This should be tagged on CLIENT spans where possible. IP address isn't valid, and should be avoided, mentioning to use remoteEndpoint.ip for that instead.

We choose to not explicitly define "http.host", though people can surely use that as they do now. By not defining "http.host" formally, we encourage more re-use of "server.host"

One tag, "server.host", for the client side will help RPC, HTTP and broker type communications, and be less complex to search and cheaper to index vs two similar ones.