Support for Envoy Filters (Particularly Lua)

projectcontour / contour

Contour is a Kubernetes ingress controller using Envoy proxy.

https://projectcontour.io

Apache License 2.0

3.73k stars 680 forks source link

Support for Envoy Filters (Particularly Lua) #3006

Closed ccravens closed 7 months ago

ccravens commented 4 years ago

I'm requesting the support of Envoy Lua filters with Contour. Lua filters provide developers the ability to apply sometimes complex and necessary modifications to headers for security or functional purposes that may not be supported via Contour YAML-based declarative manifests.

Use Cases

I have 2 use cases in which Lua filters would solve my problem that is currently not possible with the current Contour manifests:

Location Header Re-Write

While Contour offers the ability to set or remove headers, there is currently no functionality to modify headers. In my case, I need to guarantee that any Location header redirects to https rather than http. This problem is well described and documented in this issue: https://github.com/projectcontour/contour/issues/2776

The following inline Lua code applied as a filter would solve this problem:

function envoy_on_response(response_handle)
   local locationHeader = response_handle:headers():get("location")
   if locationHeader ~= nil then
      modified = string.gsub(locationHeader, "http://", "https://")
      response_handle:headers():replace("location", modified)
   end
end

SameSite Cookie Attribute

Even more complex than header modification, updated security constraints in the Chrome browser do not allow setting iframe cookies without the proper SameSite attribute. While a lot of applications are working to apply this, it would be best to enforce this requirement at the proxy level to guarantee that this works across all applications (https://github.com/envoyproxy/envoy/issues/12970).

The following inline Lua code applied as a filter would solve this problem:

         function envoy_on_response(response_handle)
            function set_cookie_header(rh)
              s = rh:headers():get("Set-Cookie")
              if string.find(s, "mycookie=" ) and not string.find(s, "SameSite") then
                s = s.."; SameSite=None; Secure"
              end
              return s
            end
            if response_handle:headers():get("Set-Cookie") then
              secure_cookie_header = set_cookie_header(response_handle)
              response_handle:headers():replace("Set-Cookie",secure_cookie_header)
            end
          end

Workaround

Initially, I was working to inject a separate XDS configuration service that would apply Lua filters to the configuration objects that Contour creates in envoy. However, due to Contour regenerating the DAG everytime a configuration update is made and re-applying the entire DAG to envoy, it would over-write any manual configurations that were applied. Therefore, the only way that we can truly get Lua filters properly supported is if they are managed by Contour.

Sample Solutions

Istio supports an EnvoyFilter CRD which supports Lua inline code as well (https://istio.io/latest/docs/reference/config/networking/envoy-filter/). The following is a sample of an Istio CRD with inline Lua code:

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: secure-session-cookie-filter
  namespace: istio-system
spec:
  workloadLabels:
    app: myingressgateway
  filters:
    - listenerMatch:
        listenerType: GATEWAY
      filterType: HTTP
      filterName: envoy.lua
      filterConfig:
        inlineCode: |
          function envoy_on_response(response_handle)
            function set_cookie_header(rh)
              s = rh:headers():get("Set-Cookie")
              if string.find(s, "mycookie=" ) and not string.find(s, "SameSite") then
                s = s.."; SameSite=None; Secure"
              end
              return s
            end
            if response_handle:headers():get("Set-Cookie") then
              secure_cookie_header = set_cookie_header(response_handle)
              response_handle:headers():replace("Set-Cookie",secure_cookie_header)
            end
          end

jpeach commented 4 years ago

xref #2475

jpeach commented 4 years ago

xref https://github.com/projectcontour/contour/issues/1015

youngnick commented 4 years ago

Thanks for the detailed request @ccravens!

I'm not opposed in principle to adding Lua filters, I can see that it's a great way for advanced users to make sure that Contour can meet their use case quickly.

However, I've got a few worries about it:

We have an outstanding issue with the xDS protocols (#1176) that means that if we pass an invalid config to Envoy, Envoy will send a NACK, which we will drop on the floor. Because of the way the messages work, this means that, even if you fix the issue, Envoy will stop accepting updates until you restart it. This interacts here with the possibility that a user of this feature will make a typo, and that will stop your Envoys accepting updates until you restart them. I'm not sure what to do here, I guess we would need to validate the Lua somehow before accepting it? Which implies that we would have to parse the Lua, which means we'd need to import a Lua parser? That seems like a pretty big change.
I'm also worried about support. Personally, I hardly know Lua at all, and I'm not sure of how we'd go supporting any Lua scripts people write. "You're using Lua so we can't help you" does not seem like a solution that makes anyone happy in this case either.
Lastly, I think that adding Lua support to what is, in most cases, a shared proxy, seems like a good way to allow straightforwarding denial-of-service attacks on the infrastructure, as long as you can submit a proxy with Lua to Kubernetes. For HTTPProxy, we've worked pretty hard to ensure that denial-of-services are as difficult as possible.

Again, I totally understand that, sometimes, Lua is the quickest and easiest way to resolve problems. We use it inside Contour for a workaround at the moment.

But once we start accepting user-provided extensions like this, security and validation become orders of magnitude harder. I'd like to find ways to ensure that we're not giving ourselves a future supportability problem before going ahead, and at the very least be making a conscious decision about risk if and when we do.

I guess what I'm looking for here is some thoughts from you, @ccravens, or other interested parties on how we could address these concerns. Maybe I'm wrong and invalid Lua scripts don't share the same properties that other config does, and the invalid config concern is not valid here. And maybe there are other things we can learn from Istio and other Envoy-based projects to help with this?

stevesloka commented 4 years ago

I agree with @youngnick that users can get into a whole bunch of trouble with this, but at the same time we need to strike a balance with features. If support is an issue, one of the troubleshooting steps can be "disable your lua filters" and then test from there.

Maybe we could add some validation commands to Contour (or a kubectl plugin) to help out and implement the config load check tool (https://www.envoyproxy.io/docs/envoy/latest/start/install/tools/config_load_check_tool.html).

I think we could also make the lua bits an op-in for users so there's a flag that needs set before allowing them to be processed. The other concern is who is allowed to add filters to what bits, so maybe a new CRD needs thought up to restrict who can create the scripts and subsequently, where they can be applied to HTTPProxies (via Conditions in a ref?).

ccravens commented 4 years ago

Hello @youngnick thanks for the response and feedback! I was hoping to get feedback to really think this through with the community and go back and forth on ideas.

I first want to address the last point about the denial of service. I can imagine if you have several engineers that are working in your k8s environment, you wouldn't want someone applying a config that could bring down the entire cluster. My thought on this is we could maybe create a separate CRD for the HTTP filters (much like Istio) and allow RBAC policies to take care of that. So if your other engineers are working with different service account (SA) credentials, you can add the RBAC policies for their SA to create a Contour HTTPFilter custom resource only if you have validated the config that they are publishing. Let me know thoughts there.

In regards to the first point of invalid configs, I was talking to @stevesloka about this and I think we have 2 issues: 1) Compile-time issue (envoy crashes due to incorrect config) and 2) Run-time issue where envoy doesn't crash because it is a valid config but then the filter doesn't do what the developer intended because of bad code. So this is tough, the question I think is "how much rope should we give our users to hang themselves with". So my honest thoughts and opinions on that matter is I would rather have the option to do what I need to do than not, and as a kubernetes engineer it's my responsibility to ensure I'm doing things correctly in my cluster at the end of the day. I feel, for example, Microsoft tried to do this with their Visual Studio platform for decades, trying to hide the details of building and deploying code. Then after 20 years developers organically developed and migrated to Visual Studio Code because it is light-weight and gives the developers the capabilities they need, even though developers may have issues they have to deal with because they are responsible for their own build processes now. If we abstract too much as a project and prevent developers from doing things out of fear they may apply incorrect configs, then we're forcing them to find hacks around the problem which is why I created the issue in the first place. For me it wasn't an option of "Lua filters seem nice, so I guess I'll try them", it was really a 2-3 month process of looking through envoy docs, envoy code, contour docs, contour code, attending meetings and identifying a solution to my problem as quickly as possible. I initially started working on hacks like injecting a bootstrap service and updating envoy configs alongside contour (which is way worse lol). So what we've had to do instead is fork Contour and implement those Lua filters in the contour code and build and deploy our own contour image for now because these feature are necessary really to the success of what we're doing even as a company (if we can't get these features, we can't survive as a company). So given that, we will definitely work to find anyway we can to accomplish this. So I think having these features available as a standard managed feature, even if it can break envoy, is better than forcing the community to find hacks and/or forking and managing their own forks of the project.

So I think maybe the conversation is a bit more philisophical about the design philosophies of the Contour project rather than this specific issue. My concern is if we barr developers too much from doing what they need to do and don't at least offer these avenues, people will eventually either create (worse) hacks to solve their problems or fork (like we unfortunately have had to do for now). I don't know if this particular github issue is a good venue for this conversation but would love to hear from the community on this as a philosphy of the contour project. Thanks!

ccravens commented 4 years ago

@stevesloka

I think we could also make the lua bits an op-in for users

I think that's a great idea as well! Create a gauge of "how much risk are you willing to accept for your particular contour deployment" might be a great way to strike that balance!

farconada commented 4 years ago

+1 for opt-in

youngnick commented 4 years ago

@ccravens Thanks for the detailed response!

You're right that this comes to the heart of the philosophy for Contour. How much configurability do we allow - or more bluntly - how much do we trust our users to not make mistakes? The thing I've always tried to keep in mind is that everyone makes mistakes, and we need to do our best to ensure that the inevitable mistakes will produce the least worst outcome possible. How flexible I've been willing to be on this has been something that I've needed to change as I came to understand more about the ways in which Contour was actually being used. I used to be much more conservative about this than I am now, but have come to understand that because the proxy is used as a fixup point for many other behaviors, we have to allow more that we used to.

I still think there's value in not allowing things that will almost always be bad - but I think that there's definitely value in having this greater flexibility available, and it may well be worth the cost, especially as an escape valve you can open if you really need it.

The converse is that with greater flexibility comes greater risk. There are three risks I am worried about here:

the risk that Lua scripts passed to filters are broken and break your Contour install, with no way to know that except for seeing things fail. I tend to be of the opinion that the thing that accepts the config should be the thing that tells you if it's broken. That's particularly hard in this case, as Contour needs to pass the script to Envoy before we really find out if it works. I would really prefer Contour not to have to parse Lua and validate it, but I don't see any way around that if we go ahead with this. The idea of taking user-submitted code and running it with no validation in a critical piece of network infrastructure just sounds every risk alert alarm in my brain.
the risk that Lua scripts can be misused by a malicious actor - How much should we protect against this? I don't think that RBAC is a panacea here. We need to be specific about what the risks are, and what we are doing about them.
lastly, and most importantly for the future of Contour, once we have this, there will be much less incentive for users to log feature requests. Both of the use cases you mentioned in your opening comment could be (and in my opinion, really should be) features added to the main HTTPProxy document. I'd like to have a way to say to people - "Hey, you're using Lua to do something cool, have you thought about asking for that in HTTPProxy?"

I don't want you to have to maintain a fork to meet your usecases, but I also want you to be able to help us build something that solves the problems you have for everyone. I'm worried that once Lua is available, people will go off and build their own Lua solutions, and not bring feature requests here.

It seems like if we were to have an opt-in, it needs to use this escape valve idea. It's intended for you to open it to keep you going while you come back and get your usecase added into Contour proper, at which time you retire your Lua and move back to vanilla. The main way to encourage this is in the documentation and probably a log message to remind people to bring the usecase back. WDYT?

I'm interested in wider thoughts on this and other topics from other operators here.

Would you come and make feature requests here if you could just solve problems with Lua?
What are your feelings if we were to say "we don't have the skills to help you with your Lua filters"?
What information would you want about if the Lua filters made it Envoy? How would you expect to be able to troubleshoot your filters?

xaleeks commented 3 years ago

Resurrecting this to understand it a bit more

the risk that Lua scripts passed to filters are broken and break your Contour install, with no way to know that except for seeing things fail.

That’s a risk with anything that plugs in and runs inline. If we support this at some point, I also don’t think it’s contour’s job to do code validation on lua scripts. I think our position would need to be you run it at your own risk and we’ll do our best to surface the errors to the users through logs. Not sure if there are static code analysis tools of lua scripts but probably overkill as part of contour. That said, I do think being able to manipulate headers in incoming requests very useful. How about we table this for just a bit now

lastly, and most importantly for the future of Contour, once we have this, there will be much less incentive for users to log feature requests. Both of the use cases you mentioned in your opening comment could be (and in my opinion, really should be) features added to the main HTTPProxy document. I'd like to have a way to say to people - "Hey, you're using Lua to do something cool, have you thought about asking for that in HTTPProxy?"

I don’t grasp the point here about less incentive to log feature requests if we deliver LUA support. Can you expand?

youngnick commented 3 years ago

That’s a risk with anything that plugs in and runs inline. If we support this at some point, I also don’t think it’s contour’s job to do code validation on lua scripts. I think our position would need to be you run it at your own risk and we’ll do our best to surface the errors to the users through logs. Not sure if there are static code analysis tools of lua scripts but probably overkill as part of contour. That said, I do think being able to manipulate headers in incoming requests very useful. How about we table this for just a bit now

Agreed, but until we solve #1176, if we send a bad Lua script to Envoy, it could NACK the update, and then there will be no more updates. Once we have NACK support, then I'd be willing to proceed in the way you're talking about.

lastly, and most importantly for the future of Contour, once we have this, there will be much less incentive for users to log feature requests. Both of the use cases you mentioned in your opening comment could be (and in my opinion, really should be) features added to the main HTTPProxy document. I'd like to have a way to say to people - "Hey, you're using Lua to do something cool, have you thought about asking for that in HTTPProxy?"

I don’t grasp the point here about less incentive to log feature requests if we deliver LUA support. Can you expand?

What I'm worried about is that people will solve their own problems with Lua, and never tell us they were problems, and we will miss out on implementing features in Contour that would be better suited in the mainline project, rather than someone's custom Lua.

andrewzah commented 3 years ago

@youngnick

What I'm worried about is that people will solve their own problems with Lua, and never tell us they were problems, and we will miss out on implementing features in Contour that would be better suited in the mainline project, rather than someone's custom Lua.

Then it would be best to remove all custom Lua code support. If you allow users to run custom code, then they will run custom code.

Agreed, but until we solve #1176, if we send a bad Lua script to Envoy, it could NACK the update, and then there will be no more updates. Once we have NACK support, then I'd be willing to proceed in the way you're talking about.

In my personal opinion, this should be documented, but permitted. It can even be gated behind a feature flag / environment variable like CUSTOM_LUA_MAY_BREAK_ENVOY.

Personally, Contour's behavior here comes across as nanny-like to me. Why can I not break my own envoy? Given the choice of forking Contour with our own changes (which we had to do) or using a good alternative, I would prefer to not use Contour at all due to this sort of attitude towards downstream users.

edit1: I would also like to mention that foreseeing all possible use cases is impossible, and shouldn't really be the scope of contour. This is precisely what things like custom Lua code are for.

youngnick commented 3 years ago

Thanks for the comment, @andrewzah. There's quite a bit to respond to, so I'll quote-and-reply.

Then it would be best to remove all custom Lua code support. If you allow users to run custom code, then they will run custom code.

Look, you're not wrong, and I guess I'm being too precious here. We should be encouraging people to tell us what they're fixing, not making it so they can't do their fixes. But I would really like a way to encourage people to bring us the things they're doing with Lua once we implement it. I'll think more about this, but I withdraw this objection.

In my personal opinion, this should be documented, but permitted. It can even be gated behind a feature flag / environment variable like CUSTOM_LUA_MAY_BREAK_ENVOY.

The problem with this break of Envoy is that it's silent. You'll have no way of knowing if you've triggered it without checking to see that your Envoy is taking changes. I'm firmly of the opinion that #1176 is one of Contour's most serious issues, but fixing it is non-trivial (which is why it's not fixed already).

As the lead maintainer, my main concern is about support. What would be your expectation about support in the event you're using the feature flag? I'm certainly not Lua-fluent, and I don't think we have heaps of Lua fluency in the maintainer team, so we wouldn't be able to offer good support to help people troubleshoot their Lua scripts.

It doesn't seem like it would be good to have to tell people "Oh, you're using Lua? Sorry I can't help you" either.

Personally, Contour's behavior here comes across as nanny-like to me. Why can I not break my own envoy? Given the choice of forking Contour with our own changes (which we had to do) or using a good alternative, I would prefer to not use Contour at all due to this sort of attitude towards downstream users.

I've been working pretty hard to make Contour less nanny-like, and more configurable, but we do have to support whatever we add into the product. I understand it's frustrating to not get the feature you need, but as maintainers we have to keep up with the future of the project, not just any one feature. Looks like here I haven't done a good job though, and for that I apologise.

Next, you and I may be comfortable with breaking our Envoy, but how do we ensure that we're protecting people that aren't? I don't know if a feature flag is enough, maybe we need a full explanation of what behavior you may see if something goes wrong, and link to that from the docs about the feature flag.

I'm also really sorry that you found it necessary to fork Contour, I'd like to hear more from you about the reasons why you did, and if what we could do to bring you back inline with upstream. Please feel free to contact me one-on-one either on the email listed in my profile, or on Kubernetes slack @youngnick.

The tl;dr here is not "We're never going to do this", but "I really want to understand what people's expectations are before we do it". Like always in open source, no is temporary, but yes is forever.

stevesloka commented 3 years ago

I think this is super important to figure out, however, I don't feel like we're moving this issue in any direction after looking through the responses.

It seems like there are a few options to solve this problem:

Allow users to specify their own lua in Contour. Get over the risks and just allow it with a feature flag or something if the thought is that it's needed. Call out in the support policy how Contour will handle it.
Take the problem presented and add it as a feature (via lua or something) in Contour by expanding the HTTPProxy API. Contour then needs to maintain all these features going forward. Also need to see where they fit in other APIs (e.g. GatewayAPI?). We can add unit/feature/e2e tests and have it run as part of normal CI to make sure something doesn't break in the future versions of Envoy.
Help expand Envoy to support the use-cases needed. If Envoy has an API, then that allows a way to configure the feature easily.

Steps to move forward:

I think we should prioritize the xDS work to take that off the table. I'm happy to jump in here any help out if the resolution of that changes how we approach the solutions listed previously.
Reach out upstream to Envoy to see if the in-flight PRs can be helped out any to get them moving again.
Provide support/feedback to users who may be maintaining a fork in the meantime (e.g. the last community call with @andrewzah).

youngnick commented 3 years ago

I think that we can take the presented problems and add them as features, yes. @andrewzah has logged #3657 to cover the location header rewrite, and we already have #2965 covering adding Samesite to cookie attributes.

That doesn't solve the larger "Can we have custom Lua" problem, but does avoid all of my objections. :) It feels like the pattern here could be "We need this feature, seems like Lua would do it, here's what we recommend" as the way to add features with Lua. If Lua is an implementation detail, not the whole feature, then the concerns about supportability for the wider community don't arise.

In terms of the xDS work, #1176 covers the wider work here. I did some investigations a little while ago, and the problem is now not as bad as it was - if Envoy rejects config updates, it will recover once you fix the config that's being rejected. We still have quite poor support for actually telling what config failed however, which @wrowe will be working on soon. I'm hoping that we can get a better feedback loop for xDS errors, which will make me much more comfortable about being more risky with sending config that might be invalid to Envoy.

xaleeks commented 3 years ago

I think one way to achieve the balance is that we allow LUA but it would be unsupported by Contour team. These are treated as 3rd party plugins. In the future, they can be submitted for official support but this mechanism is non-existent right now. But there's still a lot of risk with this and we are evaluating other approaches as well. WASM is another alternative.

lou-lan commented 3 years ago

I'll add another example here, this is not strictly speaking lua code, a collection of multiple conditions. Hope there is some help.

(AND (IN HOST www.google.com google.com) (EQ URL /search) (EQ SRC_IP 114.114.114.114) (OR (RANGE HEADER uid 100 999) (RANGE HEADER uid 10000 11000)))

youngnick commented 3 years ago

We can have Lua support, and have it be not supported by the Contour team, but we would need to validate that sending a broken Lua to Envoy doesn't produce an xDS NACK, because that has bad consequences.

Remember, those consequences look like this from a user perspective:

Multiple users are using Contour, everything is great.
Someone adds a Lua extension that isn't accepted by Envoy for some reason.
Envoy NACKs the change. This NACK is dropped on the floor by Contour aside from a single log line.
Because Contour and Enovy's xDS conversation is now out of sync, NO updates at all from Contour get to Envoy until the Lua is fixed. This includes endpoints and clusters being updated, secrets expiring, absolutely all config that Contour sends to Envoy will not be accepted fully as long as there is a NACKed response.

I'll be honest, it could only be some updates not getting to Envoy (probably updates to the HCM in question), but that's almost worse, as it will be even harder to find.

Until there is some way for users to know that the Lua they've provided isn't working, I think that providing this is just too risky.

This is not just a matter of me not wanting to have to support Lua, it's that I don't want advanced users who turn on this functionality to have the ability to silently break their Contour installation by having an invalid Lua config.

I can definitely see the utility in having an extensible way for knowledgeable users to add onto Contour, but until we can ensure that we're not producing an unsolvable problem for them in the event they make a mistake, I'm pretty worried about it.

pkit commented 2 years ago

@youngnick I'm not sure about the reasoning here. If the users are experienced enough to enable this one, they should be able to debug it. Right now, AFAIU, there is no way to supply a custom Envoy config at all? So, essentially envoy features like WASM filters (let's be frank, Lua just doesn't cut it) are barred? ExtensionService could be a good alternative, but because it cannot alter responses or even react to them (unless you go the "copy the whole body into memory" route, which is a no go) its usability is pretty limited...

Seems like forking Contour is the only way to proceed here (ex. #399 ), pretty sad.

youngnick commented 2 years ago

@pkit I'll be honest, I'm much more hopeful for WASM filters in this regard, as they are a bit less open-ended than Lua.

You are correct that there's no way to supply custom Envoy config, because of the problem listed above.

We (Contour) still have tidying up to do before we can do WASM either, because of the problem I listed above. Forking Contour to add the ability to configure Lua won't help you troubleshoot a broken Lua, because Contour isn't capable of telling you that it's broken at the moment. Yes, this is a terrible state to be in, and we are trying to fix it, but I can't say "sure we'll just give you this" right now, because then your users can silently break your entire Contour install through no fault of their own. This isn't just "the Lua you pass won't work", it's no changes in the cluster will go to Contour's Envoys.

If you do decide to fork, I understand. But please make sure you're careful about the Lua that's supplied to Contour.

tsaarni commented 10 months ago

There was a couple of unfortunate vulnerabilities in NGINX couple of months ago, that could be interesting for us too, for learning purposes.

CVE-2023-5043 and CVE-2023-5044 were related to user injecting malicious Lua and config snippets https://www.armosec.io/blog/cve-2023-5043-nginx-ingress/. The last one CVE-2022-4886 was related to path sanitization that accidentally allowed injecting config snippets, leading into remote code execution https://hackerone.com/reports/1620702.

github-actions[bot] commented 8 months ago

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

After 60d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

Mark this Issue as fresh by commenting
Close this Issue
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions[bot] commented 7 months ago

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

After 60d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

Mark this Issue as fresh by commenting
Close this Issue
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack