[Feature] Per-Store TLS configuration in Thanos Query

daviddyball commented 5 years ago

Request

Feature request to add support for per-store tls configuration.

Use Case

I have store running alongside query on my main "monitoring" cluster and we don't use TLS for comms between pods on the same cluster. I also want to pull metrics from my remote Prometheus instances which exist on different clusters, but these endpoints are TLS encrypted because they are cross-cluster

Suggestion

Change the --store argument to use a more definitive style for remote endpoints e.g.:

grpc://<ip-address>:<port>
grpc+tls://<ip-address>:<port>

This would allow specifying per-store TLS settings and allow a combination of secure and insecure comms for store endpoints.

Related PR from @adrien-f : https://github.com/improbable-eng/thanos/pull/899

daviddyball commented 5 years ago

Definitely going to try and throw something together for this.

Ideas so far are:

a) use JDBC-like URLs to configure individual stores (e.g. grpc+tls://<addr>:<port>/?Option1=y&Option2=n)

or

b) use file-based configuration like the object store.

@bwplotka If I go for option "b" should I re-use the object-store file config stuff, or are you happy for me to implement a query-store specific config file?

bwplotka commented 5 years ago

So I think we talked about it here: https://github.com/improbable-eng/thanos/pull/899

I would say for this specifc use case, let's start with option B and essentially limit, by saying --store are explicitly out of this feature.

I re-use the object-store file config stuff

What? We should never mix object store stuff with discovery stuff. In fact you have ready FileSD logic which is compatible with Prometheus FileSD. It allows arbitrary labels, so we can implement easily this:

One solution would be to do that work in the FileSD config and have labels such as tls_server_name to override TLS config.

I would say this is the way to go. What do you think? (:

Thanks for proposing this and sorry for massive delay!

daviddyball commented 5 years ago

@bwplotka Apologies, I meant I can look at writing some similar code to what is in place for FileSD to implement FileStoreConfig or something :-P Definitely don't want to mix the two 😆

bwplotka commented 5 years ago

Yea, I would say no specifc config just literally special labels we document properly in our docs

povilasv commented 5 years ago

IMO people just coud just run envoy/nginx/etc proxy and handle TLS outside of Thanos. As then we don't have to work on cert reloading all the different options, etc.

@bwplotka wdyt?

mleklund commented 5 years ago

I attempted to use nginx, but it appears that the grpc call is not sending a host header. Can anyone confirm this is the case? (Or at least anyone with better knowledge of the code and go than myself)

povilasv commented 5 years ago

Why does it need that?

I think you should be able to proxy all the traffic to GPRC backend.

Something like:

http {
    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent"';

    server {
        listen 80 http2;

        access_log logs/access.log main;

        location / {
            # Replace localhost:50051 with the address and port of your gRPC server
            # The 'grpc://' prefix is optional; unencrypted gRPC is the default
            grpc_pass grpc://localhost:50051;
        }
    }
}

mleklund commented 5 years ago

In my case I would like to park the grpc on a virtual host, instead of running a special nginx instance just for grpc.

povilasv commented 5 years ago

I don't think we pass host headers nor we have an option to configure them. So right now I don't think it will work

daviddyball commented 5 years ago

@mleklund Try getambassador.io for a nice GRPC proxy. It's what we ended up using to get around the problem of not being able to mix TLS and non-TLS stores.

@povilasv I'm not sure I understand your objection to adding this feature. Do you think it serves no purpose?

povilasv commented 5 years ago

I'ts not an objection, I just wanted to point out that it won't work right now.

Whether we want to add host header or no is a seperate discussion and it will take a bit of time.

BTW a lot of people right now run isitio / linkerd, the sidecar per service approach, so I think it's not a wide problem.

mleklund commented 5 years ago

Up to this point nginx ingress has worked for us and I would rather not complicate things any more then I need to. I have been looking at envoy and ambassador, and it may be a good fit down the road. For the moment I am using port based tcp for nginx ingress since most of our use cases are behind vpn or firewall.

I was curious if I was missing something or if there were plans to pass header info since grpc is built on top of http(2).

benjaminhuo commented 5 years ago

@daviddyball , I'm not sure if I understand your use case correctly, but does this pr https://github.com/improbable-eng/thanos/pull/508 solve your problem?

There are flags like --grpc-server-tls-cert/--grpc-client-tls-cert added by this pr in query doc and other components' docs

daviddyball commented 5 years ago

@benjaminhuo I don't think it does solve the problem here. The issue is that I'd like query to talk to different stores in different ways (e.g. TLS vs non-TLS) and there is no way to configure per-store endpoint TLS configuration

povilasv commented 5 years ago

For now we won't be implementining Per Store TLS configuration. Our general suggestion is run sidecar proxy envoy / nginx / etc.

Hopefully we will update docs soon. We will close the ticket once the docs are there.

caarlos0 commented 5 years ago

I would also like something like this.

My idea, which may be wrong, is:

a "central" cluster which gathers metrics from all other clusters
query instances on the central cluster talk to query instances in other clusters over TLS
query instances on the central cluster also talks to local cluster's store and sidecars
- right now this needs to be over TLS as well, and needs to have one endpoint for each sidecar, an extra hop et al

ideally, I would like to define something like:

    - query.cluster1.foobar.com:443
    - query.cluster2.foobar.com:443
    - query.cluster3.foobar.com:443
    - query.cluster4.foobar.com:443
    - dnssrv+_grpc._tcp.thanos-sidecar-grpc.thanos.svc.cluster.local
    - dnssrv+_grpc._tcp.thanos-query-grpc.thanos.svc.cluster.local

so, it seems that without setting --grpc-client-tls-secure the secure endpoints won't work, and if I enable it the dnssrv records won't work :(

as far as I understand there is no way of doing this right now.

benjaminhuo commented 5 years ago

Try getambassador.io for a nice GRPC proxy. It's what we ended up using to get around the problem of not being able to mix TLS and non-TLS stores.

@daviddyball I'm considering using ambassador as an edge proxy in each cluster too. Ambassador has tls termination ability, and we can use it to accept tls connection from central monitoring cluster.

But for the central monitoring cluster, is it possible for ambassador to originate tls connection to the remote cluster(egress grpc tls traffic)?

Would you share your config? thanks.

@bwplotka @caarlos0 , I found a blog post https://improbable.io/blog/thanos-architecture-at-improbable saying

In terms of requests to a remote cluster, Envoy has been used securely to proxy our request between many clusters; meaning that a request will go via an Envoy sidecar, an edge Envoy egress proxy, and over the public internet to an edge Envoy ingress proxy (all over a secure connection). This latter will then forward the request onto the Thanos Sidecar to retrieve the data for the time period and labels specified. All of this is done using server streaming gRPC.

Just wondering where I can find envoy config examples like this for ingress and especially egress tls traffic?

Thanks very much Ben

daviddyball commented 5 years ago

@benjaminhuo I had thanos-query talking directly to ambassador directly, so didn't use Ambassador for outbound. You'd probably want to use Envoy directly for that part, no?

benjaminhuo commented 5 years ago

@benjaminhuo I had thanos-query talking directly to ambassador directly, so didn't use Ambassador for outbound. You'd probably want to use Envoy directly for that part, no?

Yeah, I'm considering using envoy to manage outbound tls.

Then you've the same tls client config for all the store specified in thanos query including the one in the same cluster as thanos query?

daviddyball commented 5 years ago

We're not validating TLS at the moment @benjaminhuo , so there's no TLS config on query, just a list of store endpoints :+1:

benjaminhuo commented 5 years ago

We're not validating TLS at the moment @benjaminhuo , so there's no TLS config on query, just a list of store endpoints 👍

I see :)

bwplotka commented 5 years ago

It looks like there are mix of things in this feature requests.

So is it about host header or per store TLS config?

daviddyball commented 5 years ago

Per-store TLS config. The ability to specify TLS for some stores and disable it for others. We've since worked around the issue, so if it doesn't fit with the intended design of Thanos, then it can be closed as Won't Fix :+1:

bwplotka commented 5 years ago

How you worked around it? If you will quickly describe this it would be awesome for users (:

I guess some forward proxy?

daviddyball commented 5 years ago

We ended up switching off TLS for the time being and using Ambassador as a proxy.

The eventual goal is to use cross-cluster services via Istio, so we can just go direct without proxies.

jqueuniet commented 5 years ago

We ran into this issue with a Thanos in Kubernetes. We have stores in the cluster we wanted to communicate with in plaintext, and others in remote (bare metal) datacenters with which we wanted to use TLS.

We worked around the problem with one Envoy pod per remote datacenter, terminating the TLS connection in the cluster hosting the Thanos query. Each is tagged like a local store to be automatically picked up by Thanos' dnssrv discovery mechanism.

Still, it would probably be simpler to do away with the proxies and let Thanos speak directly (and securely) with the remote endpoints.

vbeausoleil commented 4 years ago

We ran into this issue with a Thanos in Kubernetes. We have stores in the cluster we wanted to communicate with in plaintext, and others in remote (bare metal) datacenters with which we wanted to use TLS.

We worked around the problem with one Envoy pod per remote datacenter, terminating the TLS connection in the cluster hosting the Thanos query. Each is tagged like a local store to be automatically picked up by Thanos' dnssrv discovery mechanism.

Still, it would probably be simpler to do away with the proxies and let Thanos speak directly (and securely) with the remote endpoints.

@jqueuniet I'm running into the exact same issue. I've been struggling to get a working Envoy configuration in place. Would you be open to sharing your solution here or discussing it privately? Thanks!

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

omerlh commented 4 years ago

Any update? Why this issue become stale? I have the exact same issue...

bwplotka commented 4 years ago

It became stale as no one said anything for long time (: I marked this thread as pinned as it's not solved and it's fair feature request.

We recently added way for different TLS for Ruler Query API and alertmanagers, so we might want to follow same logic and allow that for querier. Help wanted (:

stale[bot] commented 4 years ago

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

bwplotka commented 4 years ago

We want this (:

streetmapp commented 4 years ago

Curious how far along this is? I need to add the endpoints that are in my other clusters and the only way to do this was to enable the grpc-tls flag on the Query deployment. This rendered the local instances unreachable. Sure I can turn these into ingresses themselves, but would be great to just use this functionality.

caarlos0 commented 4 years ago

@bwplotka is there anyone working on this?

if no, I would like to give it a try 🙏

bwplotka commented 4 years ago

I don't think anyone is on it at the moment. It's not easy though (: We probably need something like this: https://github.com/bwplotka/thanos/blob/3b71c2d59a013198f5eff83143bc7eddc1b7c4f0/cmd/thanos/rule.go#L104 but for gRPC on Querier.

eswdd commented 4 years ago

I've got the base of this done: https://github.com/G-Research/thanos/pull/1

I still need to work on e2e tests - happy to redirect the PR here if you're interested.

eswdd commented 4 years ago

Have popped some questions on Slack around guidance on where people would like to go with this..

stale[bot] commented 4 years ago

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

tomasfse commented 4 years ago

I attempted to use nginx, but it appears that the grpc call is not sending a host header. Can anyone confirm this is the case? (Or at least anyone with better knowledge of the code and go than myself)

Same issue is happening here, since we are using a nginx with "virtual hosts" to handle traffic in the K8S clusters, sending no host header is a big problem

Any advance or idea about how this could be solved?

Thanks

mleklund commented 4 years ago

that is what I saw, the grpc implementation they are using does not send SNI information as near as I can tell.

On Wed, Apr 22, 2020 at 10:35 AM tomasfse notifications@github.com wrote:

I attempted to use nginx, but it appears that the grpc call is not sending a host header. Can anyone confirm this is the case? (Or at least anyone with better knowledge of the code and go than myself)

Same issue is happening here, since we are using a nginx with "virtual hosts" to handle traffic in the K8S clusters, sending no host header is a big problem

Any advance or idea about how this could be solved?

Thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/thanos-io/thanos/issues/977#issuecomment-617853775, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWMPIVCZGQMXR5MAMZDE6TRN4FCRANCNFSM4HBOSYOQ .

-- Mike Eklund | DealNews

stale[bot] commented 4 years ago

Hello 👋 Looks like there was no activity on this issue for last 30 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

cuotos commented 4 years ago

This would still be incredibly useful as we have had to add additional infrastructure and config in order to force all our endpoints to use SSL even if the API was in the same cluster and would normally use local endpoints.

It would be ideal to be able to set the grpc client config, per store, possibly via conf file, else via multiple flags somehow.

stale[bot] commented 4 years ago

Hello 👋 Looks like there was no activity on this issue for last 30 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

tomasfse commented 4 years ago

This is required to allow sidecars to share load balancers or reverse proxies or ingress controllers with other services

It would be great to fix it :D

stale[bot] commented 4 years ago

Hello 👋 Looks like there was no activity on this issue for last 30 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

tomasfse commented 4 years ago

I would like to emphasize the importance of this issue, because fewer balancers means less money in infrastructure

dansimone commented 4 years ago

Pinging this issue again... This would be a highly desirable feature.

bwplotka commented 4 years ago

What happened to the PR that was for it? (: Looks like it got abandoned? :thinking:

This means help wanted still, definitely high on the priority list, thanks for pinging!

dansimone commented 4 years ago

@eswdd indicated on March 6th that he was trying to get feedback on PR: https://github.com/G-Research/thanos/pull/1 Had there been any feedback from the Thanos maintainers about whether that approach had their blessing? I'm asking on behalf of anyone who would be interested in picking up where that PR left off...

bwplotka commented 4 years ago

We definitely missed that. I looked now and it looks like a good direction, but it might need a formal proposal or just more discussion on potential alternatives.

We need to remember that Querier does not only targets StoreAPI targets, but actually also RulesAPI targets and in future much more e.g TargetsAPI etc...

So far we have also multiple options to configure Querier targets: --store --store-strict --store.sd-files --rule (hidden)

By introducing yet another we just go into a more ambiguous configuration style. Instead, in order to achieve per target TLS options, we might need a total overhaul of our flags and most likely introduce just a single config file.

Initial Proposal

Introduce --targets.config (and --targets.config-file) which will expect YAML config very similar to Prometheus ScrapeConfig but for Querier targets and gRPC not HTTP. There you will be able to configure things like:
- tls_config
- proxy_url
- basic_auth
- gRPC metadata
- file_sd_configs
- static_configs
- dns_sd_configs
- something to choose what type of target it is: StoreAPI, RulesAPI, auto etc.
- if the targets are strict or not.
Remove totally capabilities of specifying client gRPC TLS from flags.

Open question: Do we want to leave older flags like --store, --store-strict, --store.sd-files --rule (hidden) I think I would remove --store-strict, --store.sd-files --rule so more complex ones, but keep --store for very basic scenarios and compatibility.

I will take it to our Thanos Community Sync (you are welcome to join! https://thanos.io/v0.13/contributing/community.md/#thanos-community-meeting) and maybe transfer this to formal proposal so everyone is aware of this change.

Initial thoughts? Would that work for everyone? (:

thanos-io / thanos