m-lab / prometheus-support

Prometheus configuration for M-Lab running on GKE
Apache License 2.0
19 stars 11 forks source link

OAuth proxy for SSO to eliminate basic auth #613

Open stephen-soltesz opened 4 years ago

stephen-soltesz commented 4 years ago

Discussed in 2020-01-15 Monitoring eng meeting: basic auth for prometheus and related services have reached their limit. It should be possible to deploy an oauth proxy that supports SSO.

Some pointers (not guaranteed to be helpful, but evidence that folks are doing this kind of thing):

robertodauria commented 4 years ago

Thank you @stephen-soltesz for researching this!

I'm adding another one to the list. This solution is basically the same as your first link (oauth2_proxy) but it lets the NGINX Ingress Controller (which we already use for TLS) create the nginx configuration automatically by adding a couple of annotations:

nkinkade commented 3 years ago

@robertodauria: I have lost state on the general work you were doing for this issue. My general understanding is that you hit an impasse which, at the time, seemed insurmountable. Is this correct? I notice that Prometheus 2.23.0 makes the React UI the default one. There is likely a way to revert to the "Classic" UI, but I haven't tried to update Prometheus to find out if this will be blocker for updating Promethus.

Would you update this issue with some details on where this stands and the block you hit? You have told me via VC and/or Slack what the block was, but apparently it didn't stick. I want to see if we can find a way to make this work to clear the way for upgrading Prometheus (and to just generally have a better auth flow).

Issue #595 is related/blocked by this issue.

nkinkade commented 3 years ago

Is this the complete set of changes you arrived at before hitting a wall?

https://github.com/m-lab/prometheus-support/compare/sandbox-roberto-oauth-proxy

robertodauria commented 3 years ago

@nkinkade Yes. I have proposed a VC to discuss the status and next steps, please let me know if that works or you prefer rescheduling.

stephen-soltesz commented 3 years ago

I'm excited to see this progressing. :+1:

nkinkade commented 3 years ago

Notes from a meeting with @robertodauria this morning regarding blockers for this issue:

Problem 1

Problem 2

Problem 3

Problem 4

Services that access prometheus that would need to support oauth:

nkinkade commented 3 years ago

It appears that Prometheus now (as of the latest version 0.24.0) natively supports TLS and http basic auth:

https://prometheus.io/docs/prometheus/latest/configuration/https/ https://inuits.eu/blog/prometheus-server-tls/

I will experiment with this next week. It's unclear how basic auth does not function with the new, default React UI via the nginx ingress, but would work with the new native support. But I may be missing something.

This would not take us to the next level for authentication or authorization, but could possibly be a workaround to start using the latest UI and newer versions of Prometheus.

nkinkade commented 3 years ago

@robertodauria, @stephen-soltesz: now that issue #595 is resolved and closed, no longer blocking us from safely updating Prometheus to the latest versions, I propose that we either close or backlog this issue due to the difficulty of making OAuth work for all possible consumers of Prometheus data in both clusters (platform and prometheus-federation). Do either of you have an opinion on this?

stephen-soltesz commented 3 years ago

An SSO-like solution would greatly simplify the team's access to these services without sacrificing operational security. The basic auth solution was only a little better than nothing, and it adds friction every time I need to open prometheus directly. The experience is more halting the less familiar one is with these systems. So, I fear it will discourage new team members from using and contributing to the monitoring system as a well curated whole. This could be one of the "broken windows" that make it easier to rationalize the next partial or "hacky" approach. See: https://en.wikipedia.org/wiki/Broken_windows_theory

I would like a better picture of what would we have to change in order to use the oauth proxy? Or, phrased differently, if we were starting from scratch, how would we organize the pieces of the system to work the way we want?

For example, once we're able to retire mlab-ns's usage (admittedly an indefinite period in the future), then rebot and grafana should be able to access prometheus directly over the private GKE network (right?), and then the question is whether a GKE network could communicate privately with the platform cluster or not.

nkinkade commented 3 years ago

In my mind, at the moment, the question isn't whether using OAuth would be a benefit or not; it clearly would be. The problem lies in migrating two clusters to that authentication mechanism, along with any services that need access to Prometheus metrics in either cluster, and sometimes both at the same time. And this from a system (Prometheus) that does not support OAuth logins, requiring additional proxying services in the cluster or at the edge, adding possibly a non-trivial amount of complex technical overlay to the overall system.

My recollection isn't that we implemented HTTP basic authentication as a solution that was "only a little better than nothing". Indeed, my recollection is that early on we had no authentication at all, and didn't consider it any sort of major shortcoming, other than the possibility of bots or malicious people swamping our Prometheus instances with expensive queries. Or in a less likely scenario someone leveraging near real-time telemetry to attempt to compromise the overall health of the system more effectively. I don't believe we felt that "security", as such, was the major consideration, but more we wanted to just put up some basic barrier to prevent flagrant abuse, either unintentional or intentional, and HTTP basic auth provided that pretty well without the need to do much else (other than use the nginx proxy, which today isn't even necessary any longer).

That aside, probably the biggest blocker right now is our "federated" scraping. We scrape the platform cluster from the prometheus-federation cluster using basic auth, which is one of just a couple auth mechanisms Prometheus even supports for scraping, the other being a bearer token. Possibly we could obviate the need for scraping the platform cluster at all if we were to migrate platform cluster alerting to the platform cluster?

Then we have, as you mentioned, mlab-ns. I'm sure that there is some python module that would allow us to use OAuth there, but how much effort do we want to put into any engineering work on the mlab-ns code base?

As it stands, all of these components already natively support HTTP basic authentication:

I'm curious where you currently find the major "friction" in using Prometheus with HTTP basic authentication? For me, there used to be some friction in constantly needing to open some "AAA Prometheus Links" dashboard in Grafana, but a year or two ago I simply added every link to my startup pages to "prime" my browser session to already be authenticated to all clusters and in all projects. And as of today, I have even further eliminated all friction by installing the Chrome extension "Multipass", which lets me use a regex to match sites with some stored basic auth credentials. Granted, this only works in my local browser, but then again I can't think of a time I really needed to access Prometheus in some way other than through my browser.

robertodauria commented 3 years ago

For example, once we're able to retire mlab-ns's usage (admittedly an indefinite period in the future), then rebot and grafana should be able to access prometheus directly over the private GKE network (right?), and then the question is whether a GKE network could communicate privately with the platform cluster or not.

There was a proposal to make Grafana contact Prometheus over private networks only, but that would mean creating inter-project networks and I thought you said it's something we want to avoid. Sandbox Grafana today can access staging/production Prometheus with HTTP basic auth, even if they are in separate GCP projects, which I think is a desirable behavior and something we want to keep.

The last time I spent a few days trying to make this work, I could log into Prometheus with my @measurementlab.net account with oauth-proxy, but then the proxy didn't like the way Grafana passed the OAuth token. Specifically, the oauth-proxy logs mentioned it could not find a valid token in the request, even after enabling "Forward OAuth Identity" and/or "With credentials" in the Data Source configuration, and I could not find a way to work around that.

A possible next step to figure out how the components interact (and if the problem is Grafana's handling of the OAuth token) could be writing a small PoC which we deploy behind nginx/oauth-proxy with an endpoint that sends an authenticated request on behalf of the currently logged in user to another service with OAuth authentication enabled. If that works, it points towards an issue with how Grafana sends the token along with the request rather than something wrong in oauth-proxy's configuration.

I agree having an SSO for the components of our infrastructure to seamlessly work with a single user authentication would be nice, even if perhaps I don't see it as fundamental - HTTP basic auth is a bit inconvenient and doesn't allow managing users but it works. I'm not convinced there is something wrong with how the pieces of our infrastructure are currently organized, but rather an issue with Grafana or oauth-proxy (or how I had configured them at the time) that we need to figure out before we can make progress on this.

nkinkade commented 3 years ago

As far as I can tell, HTTP basic auth is actually working quite well, is natively supported by all our tools, and suits our needs, minus a single use case: operators accessing the Prometheus Expression Browser in their their local browsers. Is this correct? If so, I've found that a simple browser extension eliminates issues for this use case. The more I think about it, OAuth feels more like a protocol designed to be interacted with directly by a person in their browser (fitting this use case I mention), but not so much as an authentication mechanism for non-humans (mlab-ns, rebot, Grafana accessing the backend datastore, etc.).

robertodauria commented 3 years ago

Regardless of whether we are going to implement this or not, last night I managed to get sandbox Grafana configured and correctly passing the authentication to a Prometheus data source in "Server" mode, meaning that it's the backend that connects to Prometheus and not the user's browser -- all of our data sources use server mode already.

Changes I made:

This, however, doesn't solve the problem of cross-project authentication (e.g. sandbox grafana connecting to staging/prod prometheus). We should either have one instance of oauth2_proxy that's shared across the projects, or multiple instances of oauth2_proxy with common session storage (I see Redis is an option).

This is likely the only place on the Internet where all of this is documented: https://stackoverflow.com/questions/62559654/grafana-oauth-proxy-still-displaying-native-login-form

nkinkade commented 3 years ago

@robertodauria: This is great! Thank you for doing this. Would you be able to finish setting up this up such that interactive users (OAuth) access URLs like:

https://prometheus.mlab-.measurementlab.net https://prometheus-platform-cluster.mlab-.measurementlab.net

... and automation (basic auth... mlab-ns, rebot, etc.) use URLs like the following (path doesn't matter, just something logical):

https://prometheus.mlab-.measurementlab.net/basicauth https://prometheus-platform-cluster.mlab-.measurementlab.net/basicauth

This will require us to manually modify automated services to use the new URL, but will keep things more simple for humans.

nkinkade commented 3 years ago

@robertodauria: Do you consider this issue resolved now? If so, would you close this issue?

robertodauria commented 3 years ago

Not yet. We've made some progress, but most of the issues you outlined at https://github.com/m-lab/prometheus-support/issues/613#issuecomment-759744705 are still outstanding - for example, all the clients have to be updated to use the -basicauth URL, and the platform-cluster Prometheus isn't using OAuth yet.

stephen-soltesz commented 3 years ago

https://github.com/m-lab/prometheus-support/pull/795 completed support for oauth or basicauth on prometheus federation and platform-cluster prometheus (including datasources in Grafana).

What do you all think about making this the standard configuration for prometheus in the data-processing cluster as well as for alertmanager? Is there some configuration that would make it easier to make this the default?