Traefik not updating config

adamgraves-choices commented 7 years ago

Hi,

We've got an intermittent issue where traefik isn't updating the frontend and backend configures in our Rancher environment.

New stacks and changes to stacks sometimes don't get reflected in the config, sometimes it resolves itself within approx. 10-60 minutes, but on some occasions we have to restart the Traefik stack. Sometimes that doesn't help, and we have ended up destroying the environment and rebuilding it from scratch to resolve the issue.

Last time it occurred I tested the rancher-metadata service to ensure that was working, and everything looked fine from there.

Anyone else encountering this?

joshuacox commented 7 years ago

I am indeed noticing this behavior. I have notice I have some containers set with really long health checks, and when those are in play I think this tends to exacerbate this problem.

ghost commented 7 years ago

i have the same problem. when i upgrade a server and the ip adress changes it does not get reflected in the traefik config. is there a way to manually regenerate the rules & traefik.toml? currently i restart the traefik docker and the config is correct again but this is not suitable for production

joshuacox commented 7 years ago

@rawmind0 any recommendations on how to fix this in situ? I have tried restarting either rancher-traefik or alpine-traefik, or both, with curious results. One of which being banned from letsencrypt by rate limiting :(

I'd like to know if there is a better method, perhaps a command I can run inside one of the containers to force it to reload it's configurations without dropping all the certs.

Another thought, is that maybe we could have a version of this that keeps all it's configs in a convoy-nfs mount.

I know all of this might be moot as well once traefik begins to natively support rancher.

rawmind0 commented 7 years ago

Hi guys,

sorry about the issues you have suffering. Could you please, provide some more details about??

BTW, inside alpine-traefik container, you could restart traefik or confd without the need of restart the container..

monit restart traefik
#or
monit restart confd

ghost commented 7 years ago

At the beginning everything worked fine but after some time rancher-traefik did not updated a new ip after an upgrade of a container (and the resulting ip change). it still had the old ip address for the backed. I am not sure but it could be related with a updated to rancher version 1.5.3. Currently i am testing the new nativ Traefik Rancher backend and it looks promising.

snahelou commented 7 years ago

Hello

I have the same problem with Rancher 1.5.3 and Traefik rawmind/alpine-traefik:1.2.3-1

EDIT:

Maybe it's due to confd does not refresh metadata:

bash-4.3$ curl http://rancher-metadata
curl: (6) Couldn't resolve host 'rancher-metadata'

btw dns is working on other containers and metadata works.

Due to https://github.com/rancher/rancher/issues/5041

I tried to add search into rancher ui and after upgrade dns is now working but confd is always empty :(

rawmind0 commented 7 years ago

Hi @snahelou ...

This is not the cause of the problem.... confd is able to ressolv rancher URI an connect...This problem is with alpine curl, not confd.... If you do curl http://rancher-metadata.rancher.internal it should work.....

Please, publish confd logs...../opt/tools/confd/log/confd.log inside alpine-traefik containers....

Have your services healthcheck configured??

snahelou commented 7 years ago

Hello

Yes sorry, dns was not the problem.

I had the following error

2017-05-02T12:42:28Z traefik-traefik-1 /opt/tools/confd/bin/confd[159]: ERROR template: rules.toml.tmpl:41:34: executing "rules.toml.tmpl" at <getv (printf "/stack...>: error calling getv: key does not exist

              {{- $back_status := getv (printf "/stacks/%s/services/%s/containers/%s/health_state" $stack_name $service_name $container) -}}

I remove 2 stacks and the service come back available. It's strange because stacks were green.

rawmind0 commented 7 years ago

It seems you din't have healthcheks configured....health checks are mandatory...only healthy backends are added to traefik..

snahelou commented 7 years ago

Ok, strange, healthchecks were configured because I used a jenkins multibranch pipeline and other branchs works well.

Thanks for your support.

Regards

jjscarafia commented 7 years ago

Hi! I've got an intermittent issue very similar to this one where traefik isn't updating the frontend and backend configures in our Rancher environment on every host (some hosts are updated).

New stacks and changes to stacks sometimes don't get reflected in every host config.

About our configuration:

I've two host running traefik (http://34.201.12.10:8000 and http://54.210.1.168:8000/)
One have the configuration updated (traefik-1) and the other don't (traefik-2)
Doing "monit restart confd" solves the issue but later it happens again if we add new stacks
I'm using traefik on "nginx" services inside stacks (check nginx service labels on image attached)
I'm using rancher 1.5.10 with traefik catalog "1.2.3-1" (last version)
I've test running "curl http://rancher-metadata.rancher.internal" and it works, it returns something on both hosts
Find attached log files from the two traefiks and also from the /opt/tools/confd/log/confd.log.
find attached also the healthchecks configured on nginx, all stacks are and services are on green
we are using rancher os on hosts (deployed with aws ec2)

One note, the confd log of the traefik1 shows the error "executing "rules.toml.tmpl" at <getv (printf "/stack...>: error calling getv: key does not exist", but traefik1 is the one configured ok, traefik2 is the one that is not configured ok (not refreshed). I've also check every traefik label on the servers and are exactly the same as the one attached

Anyone else with the same? Thanks! Juan

healthchecks

traefik 2 dashboard where test-portal1-14-06 service is not discovered traefik-2-dashboard

traefik 1 dashboard where test-portal1-14-06 service is discovered traefik-1-dashboard

nginx labels nginx-labels

traefik-1-confd.txt traefik-2.txt traefik-1.txt traefik-2-confd.txt

jjscarafia commented 7 years ago

Some more information, I've check file /opt/traefik/etc/rules.toml on traefik-1 and traefik-2 and on both of them the "test-portal1-14-06 " service configuration is present, don't know why traefik does not reload, perhups related to this?

jjscarafia commented 7 years ago

@rawmind0 any help on this? Any suggestion? can you please check my post in this issue

snahelou commented 7 years ago

Check if all of your stacks are green even if they have no traefik tags When I have errors on a stack, that make my confd unstable. In your case, It's very strange that one server work and not the other.

Regards

dbsanfte commented 7 years ago

When a container crashes and restarts itself, Traefik correctly removes the container from the pool but doesn't readd it once it's restarted again. I have to manually scale the stack up and down to get Traefik to pick it up. Any ideas?

Considering abandoning this image and going for the native Rancher support in Traefik 1.3 to see if that resolves it.

jjscarafia commented 7 years ago

@dbsanfte, no idea, I've try to evacuate a host and traefik updates correctly when new containers are created on other hosts. @snahelou thanks for the response! I have all stacks on green.

Some test I've done, not sure if they are the ones that makes it work now... (just in case it helps someone):

Using for host ubuntu 16.04 (docker 1.12.6) instead of rancherOS v1.0.2 (docker 17.03.1-ce) seams to work better, but it is not a conclusion yet
As @snahelou suggest here, It seams that If I stop a stack and while stack stopped (on red), if I create new stacks, conf.d gets confused and traefik config is not refreshed.
before I was adding the label "traefik.alias.fqdn" with empty value to every service where I was using traefik an with a value, only on the services that I want some value, I've delete this label and keep it only where it was necessary

Till no more red stacks and using ubuntu 16.04, traefik seams to be working ok for, at least, 24 hours

rawmind0 commented 7 years ago

@jjscarafia , your case is so strange....

In your confd log files, last update should set rules.toml file to same content....It's so strange to work just in one server.... Infrastructure services are working well on both?? traefik-2-confd.txt

2017-06-14T12:43:59Z traefik-traefik-2 /opt/tools/confd/bin/confd[143]: INFO /opt/traefik/etc/rules.toml has md5sum bf6b2298be0acf958ad37fac08f7180d should be 7
3983e979b367f06346659a41726824f
2017-06-14T12:43:59Z traefik-traefik-2 /opt/tools/confd/bin/confd[143]: INFO Target config /opt/traefik/etc/rules.toml out of sync
2017-06-14T12:43:59Z traefik-traefik-2 /opt/tools/confd/bin/confd[143]: INFO Target config /opt/traefik/etc/rules.toml has been update

traefik-1-confd.txt

2017-06-14T12:44:09Z traefik-traefik-1 /opt/tools/confd/bin/confd[24]: INFO /opt/traefik/etc/rules.toml has md5sum bf6b2298be0acf958ad37fac08f7180d should be 73
983e979b367f06346659a41726824f
2017-06-14T12:44:09Z traefik-traefik-1 /opt/tools/confd/bin/confd[24]: INFO Target config /opt/traefik/etc/rules.toml out of sync
2017-06-14T12:44:09Z traefik-traefik-1 /opt/tools/confd/bin/confd[24]: INFO Target config /opt/traefik/etc/rules.toml has been updated

With ubuntu and docker 1.12.6 is working well???

jjscarafia commented 7 years ago

Hi @rawmind0 and thanks for the comments!

I've just update all infrastructure services (they show an available upgrade).
Yes, it seams that with ubuntu 16.04 (docker 1.12.6) it is working ok but I will give chance to rancherOS again and will share the results
The only "red" container I've, is the "rancher-agent-bootstrap" that is only visible on hosts (image attached). Could this be bothering on any way?

@rawmind0 just in case you are available and want, I can give you access to the rancher, just send me an email to jjs@adhoc.com.ar

seleccion_055

rawmind0 commented 7 years ago

Hi @jjscarafia ...

The most strange is that it works in one server and not in the other one. Please, upgrade infrastructure services to the latest version.
More that rancheros or ubuntu, the problem could be with docker version 1.12.6 vs 17.03-1....Thanks for test and share results, i really appreciate...
The only "red" container that could affect traefik confd, would be in stacks with traefik.enable=true, these are the only that confd looks for.

Best regards....

jjscarafia commented 7 years ago

I've been playing for a while and I can see that:

I could reproduce the error of traefik conf not updating by stopping stacks (they become red) and create new stacks with traefik labels.

During that period the log looks like:

"/stack...>: error calling getv: key does not exist
2017-06-22T21:05:46Z adhoc-traefik-traefik-3 /opt/tools/confd/bin/confd[23]: ERR
OR template: rules.toml.tmpl:41:34: executing "rules.toml.tmpl" at <getv (printf
"/stack...>: error calling getv: key does not exist
2017-06-22T21:06:01Z adhoc-traefik-traefik-3 /opt/tools/confd/bin/confd[23]: ERR
OR template: rules.toml.tmpl:41:34: executing "rules.toml.tmpl" at <getv (printf
"/stack...>: error calling getv: key does not exist
2017-06-22T21:06:16Z adhoc-traefik-traefik-3 /opt/tools/confd/bin/confd[23]: ERR
OR template: rules.toml.tmpl:41:34: executing "rules.toml.tmpl" at <getv (printf
"/stack...>: error calling getv: key does not exist
2017-06-22T21:06:31Z adhoc-traefik-traefik-3 /opt/tools/confd/bin/confd[23]: ERR
OR template: rules.toml.tmpl:41:34: executing "rules.toml.tmpl" at <getv (printf
"/stack...>: error calling getv: key does not exist

After re-starting stopped stacks, traefik conf was updated automatically again
This didn't happens always, sometimes I can stop an stack and new stacks are auto discovered (I guess it was related to sorting, stack names or something like that)
I couldn't replicate yet again the error where one traefik conf was updated ant the others not

dbsanfte commented 7 years ago

Moving over to the native Traefik Rancher support resolved my issue with my crashed/auto-restarted Node.js containers not being picked up by this image.

jjscarafia commented 7 years ago

@dbsanfte good to know that and thanks for sharing. Are you also using acme support with native rancher support?

dbsanfte commented 7 years ago

No we're just defining a plain old SSL cert/key, no ACME.

lasley commented 7 years ago

I just hit this one too. In my case, a host went down which caused some stacks to migrate to another host.

There were some other stacks that were simply stopped because I didn't want them alive at the moment. Traefik did not start updating until I started those stacks as well, which I could then stop at my leisure.

jjscarafia commented 7 years ago

@lasley moving to native traefik support to rancher make it works ok for me. If it helps, this is my very ugly rancher-catalog template

adamgraves-choices commented 7 years ago

@jjscarafia I've built something similar using the native rancher templates: https://github.com/nhsuk/traefik-rancher

Unfortunately I've come across a critical bug which stops us using Traefik for now: https://github.com/containous/traefik/issues/1927

jjscarafia commented 7 years ago

@adamgraves-choices thanks for the feedback. It seams that was the issue I've face yesterday...

lasley commented 7 years ago

Honestly I thought I was just screwing up somehow so I wasn't even going to say anything 😆

percosys commented 7 years ago

I am having a similar issue. I was able to get past the error in the log message by setting an environmental variable CONF_PREFIX to /latest which seems to have triggered confd to look at the latest route in the rancher metadata service not the default of /2015-12-19. However I am still having an issue with the correct rules being written.

When confd completes its interval I do in fact see a new /opt/traefik/etc/rules.toml file but it is missing the URL and backends params shown in the template.

I believe it is skipping over the following block in the template because rancher-meta has not yet registered the container is healthy by the time confd finishes writing the new rules.toml.

{{- if eq $back_status "healthy" }}
    [backends.{{$service_name}}__{{$stack_name}}.servers.{{getv (printf "/stacks/%s/services/%s/containers/%s/name" $stack_name $service_name $container)}}]
                {{- if eq $traefik_protocol "https"}}
      url = "{{$traefik_protocol}}://{{getv (printf "/stacks/%s/services/%s/containers/%s/primary_ip" $stack_name $service_name $container) -}}:
                {{- else}}
      url = "http://{{getv (printf "/stacks/%s/services/%s/containers/%s/primary_ip" $stack_name $service_name $container) -}}:
                {{- end -}}
                {{- if exists (printf "/stacks/%s/services/%s/labels/traefik.port" $stack_name $service_name) -}}
                    {{getv (printf "/stacks/%s/services/%s/labels/traefik.port" $stack_name $service_name)}}
                {{- else -}}
                80
                {{- end}}"
      weight = 0
              {{- end -}}
            {{- end -}}

It seems to be when confd is trigged to run it detects a change in the number of stacks in "latest" but it if the container is not "healthy" by the time it writes the new rules file it will skip over that part of the template.

My suspicion is since the number of stacks doesn't change by the next interval the rules.toml doesn't get updated until the number of stacks change in rancher, which could be a long time or even never.

If my suspicion is correct then is there a better methodology of updating the rules.toml other then counting the number of stacks in rancher?

I do have health checks configured on all my stacks so I am not sure how to move forward.

Once again assuming that confd is only looking for a change in number of stacks in the environment I see 3 possible solutions.

Some how sandbag the confd process from completing before all services are healthy. This might not be desired as not every service in an environment could potentially be unhealthy during an execution causing the service to never complete.
Have a second "nested" key in the rules.toml.toml file that some how dynamically checks the individual health of each container before executing rules.toml.tmpl. This also seems like it could break down similar to option one if some containers in the environment are never healthy.
Rewrite the rules.toml on an interval regardless of changes to the stack so that on a predictable timeline the rules.toml will be updated with any healthy container regardless of the changes to the stack.

lasley commented 7 years ago

@alexisaperez - Regarding confd - I think that it's a dumb implementation & simply rewrites the rules every X units of time.

The reasoning behind this assertion is that when I make the comma change in #51, it's just a few seconds until the rule is updated in Traefik. I'm definitely no confd expert though, so it's possible it's noticing the change in the rules file itself and triggering the update.

percosys commented 7 years ago

@lasley I thought that at first as well, but in my testing it seems that the rules.toml only gets updated when the number of stacks in the environment changes. I also am not an expert in confd it is just what I observed. I think one way that might solve the issue for my environment at least would be to change the key in the rules.toml.toml from /stacks to /containers but I will have to report back on if thats feasible.

adepretis commented 7 years ago

I'm also having the same problem with frontends/backend not getting updated although everything is green and healty - confd.log logs show plenty of:

2017-10-12T12:55:53Z traefik-traefik-1 /opt/tools/confd/bin/confd[24]: ERROR template: traefik.crt.tmpl:1:20: executing "traefik.crt.tmpl" at <getv "/traefik/ssl_c...>: error calling getv: key does not exist

rawmind0 commented 7 years ago

Hi all,

From alpine-traefik release 1.4.0-3, traefik built in rancher integration is supported, metadata and api. Also, community-catalog is already updated. Now 3 rancher integration are available, metadata, api ( traefik built in) or external (rancher-traefik).

Take into account that labels are different with traefik built in integration, https://docs.traefik.io/configuration/backends/rancher/#labels-overriding-default-behaviour Metadata with longpoll is the prefered integration, it’s working so good. :)

Also, I made a PR that is merged and will be included in next traefik release with a refactor of rancher integration. https://github.com/containous/traefik/pull/2291

Best regards...

jjscarafia commented 7 years ago

Great news, great work! Thanks for the update!

rawmind0 commented 6 years ago

Hi all,

rancher-traefik updated to use rancher-template instead confd to get immediate updates from metadata. Traefik external integration use it.

Best regards...

rawmind0 / rancher-traefik

Traefik not updating config #42