lucaslorentz / caddy-docker-proxy

Caddy as a reverse proxy for Docker
MIT License
2.64k stars 162 forks source link

[2.0] An invalid service configuration takes down the entire cluster #153

Closed piaste closed 4 years ago

piaste commented 4 years ago

Hi! Congratulations on shipping 2.0. We've been giving it a try in our staging environment and it has worked quite well, Caddy v2 in particular seems to solve some sporadic TLS problems we were occasionally encountering with Apple devices.

We have, however, also run into a serious issue that didn't affect version 1.x of the proxy. If only ONE service in the Swarm cluster includes an invalid name that begins with caddy, it does not get skipped - instead, the generated Caddyfile gets rejected as a whole, and all services become unavailable (!).

As you can imagine, this is extremely dangerous. Of course, we can use the staging environment to check that the deployed docker-compose files include valid labels. But let's say you need to try changing some labels to update the configuration of one microservice - one syntax error, and you take down the whole cluster.

As I mentioned, we're pretty sure that version 1.x of the proxy didn't have this problem - we screwed up labels plenty of times, and other unrelated services stayed up.

Caddy v2 exposes a caddy validate CLI command to check that a Caddyfile is valid, so IMO, the Caddyfile update loop should proceed something like this pseudocode

var caddyfile = ""
for service in services do
    var new_caddyfile = merge_caddyfiles(caddyfile, generate_caddyfile(service))
    if caddy.validate(new_caddyfile) then
        caddyfile = new_caddyfile
    else
       print("Service {service.name} contains invalid caddy labels and will be ignored")

caddy.load(caddyfile)

`

lucaslorentz commented 4 years ago

Thanks for trying v2 and reporting issues.

That behavior of validating caddyfile before applying is still present in v2. But, just like v1, it is not on by default.

Can you please double check if process caddyfile is "on" on your container? Right after the container starts it logs into stdout the value of all settings, it prints something like [INFO] ProcessCaddyfile: <VALUE>.

In case ProcessCaddyfile is false, please try enabling it using env variable CADDY_DOCKER_PROCESS_CADDYFILE=true or CLI argument -process-caddyfile.

piaste commented 4 years ago

Thank you so much for the quick response!

I must apologize for missing that configuration option in the README. It works as expected, so I'm going to close this issue. I can confirm for a fact that we never enabled the similar option in 0.3.6, so either our memories of trying to get the rewrite rules to work are confused (which we should take as the null hypothesis, in such cases), or something else was going on.

I do have to wonder though, any reason why this option is not enabled by default? I can't come up with a scenario where having a single service fail could be worse than having the entire network fail.

francislavoie commented 4 years ago

FWIW I think having it validate by default seems ideal - optionally opt-out if needed (but I don't really know why you'd need to)

lucaslorentz commented 4 years ago

Please, see below a description of behaviors.

Let's consider a scenario where I have site-A and site-B being served by Caddy.

process-caddyfile off (default behavior): If anything is wrong in the Caddyfile, the whole new Caddyfile will fail to be applied, and Caddy will continue running it's last configuration.

If I break site-A:

process-caddyfile on: Splits a Caddyfile into several different server blocks, validates each server block independently, and removes invalid server blocks from the final Caddyfile.

If I break site-A:

I hope everything is still working as I described above. I see value on both approaches. I'm not sure which one should be the default.

Maybe we should have another option as well:

There is some information here: https://github.com/lucaslorentz/caddy-docker-proxy/issues/120

piaste commented 4 years ago

If anything is wrong in the Caddyfile, the whole new Caddyfile will fail to be applied, and Caddy will continue running it's last configuration.

Ah, I understand now the intended behaviour, and I agree it's a valid choice of default.

After reading through the linked thread, I still believe that enabling it by default seems like the better option. The restarting issue is the main concern here - it's very unexpected, in a Docker environment, to restart a perfectly working container and get back a broken one.


Separately, I will also reopen the issue, because that behaviour doesn't seem to work in our staging environment. Here's what I tried (using the most recent build):

  1. Set the env var CADDY_DOCKER_PROCESS_CADDYFILE to false
  2. Checked that service A is online
  3. Added the invalid label caddy.foo=bar to service B

Expected behaviour: service A is online Actual behaviour: service A is offline

The container is running with 3x replicas and the Docker socket proxy described here, and the following environment:

CADDY_CONTROLLER_NETWORK
caddy_16

CADDY_DOCKER_VALIDATE_NETWORK
false

DOCKER_API_VERSION
1.37

DOCKER_HOST
tcp://socket:2375

CADDY_DOCKER_PROCESS_CADDYFILE
true

Here's the startup logs after each step. In case it's relevant, the network caddy_16 exists, so I'm not sure why it throws an error, but it's the only one the service is connected to so the environment variable should be redundant.

  1. Set the env var CADDY_DOCKER_PROCESS_CADDYFILE to false

    2020/06/02 22:14:31 [ERROR] Failed to parse CADDY_CONTROLLER_NETWORK caddy_16: invalid CIDR address: caddy_16
    2020/06/02 22:14:31 [INFO] Running caddy proxy server
    {"level":"info","ts":1591136071.615152,"logger":"admin","msg":"admin endpoint started","address":"tcp/localhost:2019","enforce_origin":false,"origins":["localhost:2019","[::1]:2019","127.0.0.1:2019"]}
    {"level":"info","ts":1591136071.61538,"msg":"autosaved config","file":"/config/caddy/autosave.json"}
    2020/06/02 22:14:31 [INFO] Running caddy proxy controller
    2020/06/02 22:14:31 [INFO] CaddyfilePath:
    2020/06/02 22:14:31 [INFO] LabelPrefix: caddy
    2020/06/02 22:14:31 [INFO] PollingInterval: 30s
    2020/06/02 22:14:31 [INFO] ProcessCaddyfile: false
    2020/06/02 22:14:31 [INFO] ProxyServiceTasks: false
    2020/06/02 22:14:31 [INFO] ValidateNetwork: false
    2020/06/02 22:14:31 [INFO] Swarm is available: true
    2020/06/02 22:14:31 [INFO] Skipping default Caddyfile because no path is set
    [ERROR] template: :1:2: executing "" at <nil>: nil is not a command
    2020/06/02 22:14:31 [INFO] New Caddyfile:
    http://service.mysite.com {
    reverse_proxy http://container:8080
    }
    https://service2.mysite.com {
    reverse_proxy http://container2:80
    }
  2. Checked that service A is online

  3. Added the invalid label caddy.foo=bar to service B

2020/06/02 22:19:06 [INFO] Skipping default Caddyfile because no path is set
[ERROR] template: :1:2: executing "" at <nil>: nil is not a command
2020/06/02 22:19:06 [INFO] New Caddyfile:
http://service.mysite.com {
   reverse_proxy http://container:8080
}
https://service2.mysite.com {
   reverse_proxy http://container2:80
}

2020/06/02 22:19:06 [ERROR] Failed to convert caddyfile into json config: Caddyfile:40: unrecognized directive: foo
2020/06/02 22:19:06 [INFO] New Config JSON:
2020/06/02 22:19:06 [INFO] Sending configuration to localhost
{"level":"info","ts":1591136346.3176079,"logger":"admin.api","msg":"received request","method":"POST","host":"localhost:2019","uri":"/config/apps","remote_addr":"127.0.0.1:51466","headers":{"Accept-Encoding":["gzip"],"Content-Length":["4"],"Content-Type":["application/json"],"User-Agent":["Go-http-client/1.1"]}}
{"level":"info","ts":1591136346.3180037,"logger":"admin","msg":"admin endpoint started","address":"tcp/localhost:2019","enforce_origin":false,"origins":["localhost:2019","[::1]:2019","127.0.0.1:2019"]}
2020/06/02 22:19:06 [INFO][cache:0xc000a24eb0] Stopped certificate maintenance routine
{"level":"info","ts":1591136346.3186653,"msg":"autosaved config","file":"/config/caddy/autosave.json"}
{"level":"info","ts":1591136346.3191307,"logger":"admin","msg":"stopped previous server"}
2020/06/02 22:19:06 [INFO] Successfully configured localhost
caddy2_caddy.1.inzwmgwgrkv0@swarm-worker00000S    | {"level":"info","ts":1591136346.7973533,"logger":"admin","msg":"stopped previous server"}
lucaslorentz commented 4 years ago

@piaste I will do some tests and fix the behavior. Probably will change default behavior as well.

Btw, you don't need to set CADDY_CONTROLLER_NETWORK because you're running all containers in standalone mode. Also CADDY_CONTROLLER_NETWORK must be defined as CIDR, like 192.168.0.1/24.

If you want, you can switch to distributed mode, then you will have 1 controller caddy instance that will connect to docker host, generate caddyfile, and push it to 3 server caddy instances. Then you would need to set a CADDY_CONTROLLER_NETWORK, like this example: https://github.com/lucaslorentz/caddy-docker-proxy/blob/master/examples/distributed.yaml

lucaslorentz commented 4 years ago

@piaste The behavior when CADDY_DOCKER_PROCESS_CADDYFILE=false was fixed. The default value for process-caddyfile was changed to true. Those changes will be included in next release, for now they're available only in CI tagged images.

lucaslorentz commented 4 years ago

Released in https://github.com/lucaslorentz/caddy-docker-proxy/releases/tag/v2.1.0

piaste commented 4 years ago

@lucaslorentz

Thanks for the fix! I will check out v2.1 as soon as I can.