grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
4.04k stars 524 forks source link

Use fully qualifed domain name for storage backends when running on Kubernetes #1726

Open electron0zero opened 2 years ago

electron0zero commented 2 years ago

Problem:

Kubernetes configures ndots=5 under /etc/resolv.conf options, that means any hostname that has less then 5 dots in it is looked up locally first, and then finally it is resolved.

As per our tests, we have seen 11 local lookups before DNS resolution is done (see https://github.com/grafana/jsonnet-libs/pull/762 for more details). These local lookups add latency, take up space in the conntrack table and add extra load to the kube-dns service.

Tempo makes a tons of API calls to storage backend, and these extra DNS in storage backend calls adds to read and write latency for tempo.

We started looking into it because we noticed DNS Lookup Errors (see https://github.com/grafana/tempo/issues/1462), which we improved with https://github.com/grafana/tempo/pull/1632, and lowered ndots=3 to deal with it.

There are two ways to solve this:

  1. Use FQDN (hostname tld with dot) in storage backend config, and that will hint kube-dns to skip local search and do a direct lookup
  2. lower ndots but it can break kubernetes service discovery [we can't do that]

solution 2 is not ideal so we are only left with solution 1.

This is NOT an issue if you are running tempo on VMs, In most Linux distros ndots is not configured, and default value of 1 is used. if cat /etc/resolv.conf | grep ndots shows nothing, that means it's set to default.

See more:

Solution:

We start using FQDN in storage config, and suggest our users to use FQDN when running tempo on kubernetes.

Azure Blob Storage: If we just add dot at the end of TLD, we get 400 (see more https://github.com/grafana/tempo/issues/1462#issuecomment-1224082681), that happens because Azure doesn't allow trailing dot in Host header. We need to trim dot from Host header in Azure Client

Google Cloud Storage: with trailing dot in Host header, we get 301 with location: set without trailing dot in url.

AWS S3: trailing dot is not allowed in Host header, we get 404 The specified bucket does not exist error back from S3

Minio: not tested yet

AlexDCraig commented 2 years ago

Is the idea behind this to set all other configs aside from Azure to have a dot at the end of their base storage URLs? In other words, Azure will remain with a lot of DNS hits? Since looking at the Azure golang library one cannot set arbitrary request headers, and as such it doesn't seem possible using that library to get around the problem that they will not permit dots at the end of the container URL

electron0zero commented 2 years ago

I tested other providors, and looks like we will need to do what curl does with trailing dot. curl trims traling dot from Host header but keeps it in URL, see https://github.com/curl/curl/commit/3a6142865fec3c54cd7081ada86c93c135e4b32f for details.

We can put trailing dot in config url, but trim it from Host header by using a custom transport in the client.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.

Aki0x137 commented 1 month ago

@electron0zero I'd like to look into this issue/enhancement. Could you clarify the specific changes required? Looking at the thread, I'm assuming expectation is to:

  1. Add assertion to ensure that storage backend URIs use fully qualified domain names (FQDNs).
  2. Append a trailing dot (.) at the end of the domain name( if not present) to ensure non-local DNS lookup.
  3. When making API calls to the storage backend, strip the trailing dot from host header.
electron0zero commented 4 weeks ago

@Aki0x137 yup, we want to warn users on startup about it, if the backend URI is not FQDN. we want to have the same behaviour that curl has for handling this.