Open codefromthecrypt opened 6 years ago
+1 for avoiding redundant information, i.e. no port in http.host
. If a user insists to have a port, we could define an optional property like http.authority
which is clearer.
The Host
header may contain a port that does not match the port of the request; e.g. if it has gone through a load balancer.
If you add a http.host
field, I think it should contain the full contents of the host (or :authority
) header as sent/received by/from the client.
Mixed feelings here. In server side you do not always know the host whereas in client side host can be a domain, an IP or just a name (if in a docker compose network). What I would prefer as a user is to have an boolean option which is true when I want to record the host and false when I don't both in server and client side instrumentation.
In any case, being explicit about host not including the port is important.
Does that make sense?
hi JC so this is http specific, so in addition to the endpoint details. When talking about a tag usually we dont have interdependencies such as this flag means interpret a certain way.
my advice would be to stick with http host or authority as they are defined in http semantics.. then decide whether or not it should be standard or not (ex recorded by default). this would simplify the issue I think.
hi JC so this is http specific
I would expect so from the http.
prefix. The Host
header is also HTTP specific. I don't see the issue/question?
@adriancole I should have been more clear. I meant that having a flag in the options of the middleware that says recordHost: true|false
so user could decide whether it is meaningful or not to record such information.
In all setup I have been dealing with (20 services) http.host
is not so relevant because in client side it is usually serviceName
what you check. In server side it can be anything (IP, AWS instance id or localhost). I see this tag in both cases as optional.
@adriancole https://github.com/adriancole I should have been more clear. I meant that having a flag in the options of the middleware that says recordHost: true|false so user could decide whether it is meaningful or not to record such information
ah ok. well I see your point now. hmmm I guess if we decided to not make it standard then people will add it anyway like they do today or we can do something like you mention to make it easier. otoh if we were going the route of not recording by default I might be more inclined to list the "standard tags" desired vs a flag for each. this could be programmatic or declarative like http.client.tags=http.method,http.path,http.host I think some people maybe python does this already.
at any rate suspect we should get a sense of who wants the default to include host or more importantly who doesn't. I think we can sort a nice way our regardless of the decision.
Yeah that sounds reasonable. For example, in Go http middleware the
http.response_size
is not recorded by default (because it the response
body is a buffer and if you want to measure it you gotta read it and that
drains the buffer) so I see this per-tag option comming. recordedTags:...
sounds reasonable to me.
Den lør. 11. aug. 2018, 11:53 skrev Adrian Cole notifications@github.com:
@adriancole https://github.com/adriancole I should have been more clear. I meant that having a flag in the options of the middleware that says recordHost: true|false so user could decide whether it is meaningful or not to record such information
ah ok. well I see your point now. hmmm I guess if we decided to not make it standard then people will add it anyway like they do today or we can do something like you mention to make it easier. otoh if we were going the route of not recording by default I might be more inclined to list the "standard tags" desired vs a flag for each. this could be programmatic or declarative like http.client.tags=http.method,http.path,http.host I think some people maybe python does this already.
at any rate suspect we should get a sense of who wants the default to include host or more importantly who doesn't. I think we can sort a nice way our regardless of the decision.
— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/openzipkin/zipkin/issues/2167#issuecomment-412264607, or mute the thread https://github.com/notifications/unsubscribe-auth/AC7sAjbbaChpxE_HOY3VBce3zk4SX4Phks5uPqmsgaJpZM4V49h7 .
--
José Carlos
EDIT: Ignore this.
What's the advantage over endpoint.ip? Just that you get a human readable string?
I'm not sure how kubernetes works, but in our paas the hostname is a combination of host ip and docker container id. Which means it's very high cardinality and so indexing would probably be very expensive. And I'd assume this problem to get more common as more companies move to containerized services.
That's the main reason why we don't add the hostname as tag right now.
Sent from my iPhone
On Aug 11, 2018, at 12:06 PM, José Carlos Chávez notifications@github.com wrote:
Yeah that sounds reasonable. For example, in Go http middleware the
http.response_size
is not recorded by default (because it the response body is a buffer and if you want to measure it you gotta read it and that drains the buffer) so I see this per-tag option comming.recordedTags:...
sounds reasonable to me.Den lør. 11. aug. 2018, 11:53 skrev Adrian Cole notifications@github.com:
@adriancole https://github.com/adriancole I should have been more clear. I meant that having a flag in the options of the middleware that says recordHost: true|false so user could decide whether it is meaningful or not to record such information
ah ok. well I see your point now. hmmm I guess if we decided to not make it standard then people will add it anyway like they do today or we can do something like you mention to make it easier. otoh if we were going the route of not recording by default I might be more inclined to list the "standard tags" desired vs a flag for each. this could be programmatic or declarative like http.client.tags=http.method,http.path,http.host I think some people maybe python does this already.
at any rate suspect we should get a sense of who wants the default to include host or more importantly who doesn't. I think we can sort a nice way our regardless of the decision.
— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/openzipkin/zipkin/issues/2167#issuecomment-412264607, or mute the thread https://github.com/notifications/unsubscribe-auth/AC7sAjbbaChpxE_HOY3VBce3zk4SX4Phks5uPqmsgaJpZM4V49h7 .
--
José Carlos — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I believe this is about adding HTTP host (e.g. zipkin.io) which is generally just one string, not machine hostname which will be one of many load balanced machines and thus high cardinality.
For reference, in Kubernetes, you usually refer to other services as something like my-cool-service.svc.less-cool-namespace.cluster.local
. Now that I think about it, if I could convert this into the service name (my-cool-service
), I wouldn't need to set localServiceName
to Tracing
all the time which would be nice.
@anuraaga you're right, I misunderstood the description.
IMO, I prefer the http.host
on the client side, which has more meanings. The client-side host is real and makes sense for developer or ops people.
On the other hand, I am not sure the meaning of server-side host.
I think there are many systems where the host is the only way, or the only practical way, for a request to reach a server. In such a case I guess the host has meaning in a server-side context too.
Really though, it means that host doesn't provide additional value over the service name where it's generally semantically equivalent - I don't know if I'd recommend tagging http.host
but would possibly recommend instrumentation provide an easy way to convert host to service name.
FYI
When annotating incoming requests tapper_plug records the Plug.Conn.host
field, which is purely the domain name part, as the http.host
annotation. This is decoded by the underlying adaptor, typically cowboy
in the Elixir world, from the Host
header.
What gets annotated on outgoing HTTP client requests is entirely up to the user of the HTTP client, since the library provides no integrations for any particular clients, nor any helpers (e.g. for translating a URL into annotations).
NB the Host
header seen by pods in Kubernetes is under the control of the ingress controller, and therefore will vary, depending on how the ingress controller is configured, i.e. you might see the external service host, but you might not, and indeed the header could even be dropped entirely (though this is not valid HTTP 1.1). Equally, the ingress controller may be configured to send a useful X-Forwarded-For-Host
header, or it may not. Hopefully something useful ends up in the request, that can be used for the http.host
annotation, but it is unlikely to be consistent across tracer clients, especially across system boundaries.
I would vote for client side http.host
, as it would less likely to explode and makes client span complete and easier to reason about. This is especially relevant for calls to external APIs or uninstrumented endpoints.
Need to check kubernetes ingress controller to see if it forwards requests to services (which would be stable as well)..
At SmartThings we could use the host for client side. I think it would be best as a default but regardless I will end up adding it soon as something we track a few reasons I feel we need it:
We've recently had this concern again in Brave, where client-side host/authority tag could be handy https://github.com/openzipkin/brave/issues/950
I'd kind of like to just land this soon as it seems most agree a client-side tag referring to this data is already in use. So, this is more a formalization effort.
We could consider a tag "server.host/name", which could be more general (for RPC and MySQL) too...
This wouldn't prevent an additional/alternative tag for "http.authority/host", just be more general purpose for CLIENT/PRODUCER spans (for RPC and MySQL) too.
Using the word "server" allows people to see this is the client point of view, and could help avoid data problems where otherwise someone might tag a SERVER span incorrectly.
While "server.host" or "server.name" isn't perfect (ex it should be "broker.host" in messaging), I think that partial nuance can be cleared up in docs.
For value validation, we can say that this is not an IP address, rather to use remoteEndpoint for that. However, we'd probably end up with port info here, as that's a natural component of URL authority.
Thoughts?
To punt a specific plan of action. 👍 this comment if you want the following. reply back if you want something different instead!
define "server.host" tag with only the host (not the port part) as defined on the client side. This should be tagged on CLIENT spans where possible. IP address isn't valid, and should be avoided, mentioning to use remoteEndpoint.ip for that instead.
We choose to not explicitly define "http.host", though people can surely use that as they do now. By not defining "http.host" formally, we encourage more re-use of "server.host"
One tag, "server.host", for the client side will help RPC, HTTP and broker type communications, and be less complex to search and cheaper to index vs two similar ones.
Right now,
http.method
,http.path
andhttp_status.code
(when not success) are typically added by default in zipkin tracers. This limits the size (cost) of network bandwidth, storage, indexing and lookup especially as spans are recorded at high rates. Should we advise tracers addhttp.host
also? If so should it include the port? Should we do it both client and server side?Details
http.host
(as we define it), is an alternative to IP lookup, just the domain portion. Some libraries (I don't know all of them, but for example armeria census addhttp.host
by default.We've had @golonzovsky request for this to be added by default so that he doesn't have to change instrumentation policy. This can make sense, but we have to be careful as for example how host is defined will impact our ability to search and also add indexing cost to everyone.
For example, we already record port as endpoint.port. By not also including port in
http.host
, we avoid some cardinality concerns and some search concerns as well (ex http.host=foo won't match http.host=foo:80). While people expect DNS names to be stable there's certainly a chance hostnames can be mostly random instance names. This would add indexing load to everyone, in a way that is different thanhttp.method
which is fixed except weird protocols.I think it would be good advice to at least parse
http.host
as we define it, ex not including redundant port info, but data policy should be widely considered. I'll cherry pick some folks for feedback.. I don't want to scare people about data cardinality.. ex we already recordhttp.path
, and that has unlimited cardinality in a way worse thanhttp.host
. If we decide to do this, I think we should also be parsimonious as possible.. For example, is this something we could only add to either the client or server side and not both by default?For those who want
http.host
, or are recording it already, please mention your use case!Thanks much cc @openzipkin/core @drolando @wu-sheng @marcingrzejszczak @jkschneider @jettify @cppKin @ellispritchard @chris-vale-ck @mosesn @jplock @ivantopo @nicmunroe @james-callahan @chuan-yun @shimamoto @hyleung @bogdandrutu @trustin @anuraaga