Timeout and Retry? - Githubissues

patran commented 7 years ago

I cannot fully reproduce it, but at times, if a service takes a long time (several minutes) to start up, the proxy does not always re-configure/re-generate haproxy.cfg. A docker restart proxy does produce a correct haproxy.cfg.

I have observed 2 situations:

Front-end generated, but no back-end.
Neither front-end nor back-end generated.

Is there a default timeout somewhere? If there is a timeout or max retries, would it be possible to have a default but allow user override?

vfarcic commented 7 years ago

First a bit of background...

Docker Swarm creates an entry in its service discovery as soon as a service is created even if that means that it will be operational minutes later (until image is pulled and the application inside a container is started). There is no easy way to find out from Docker API the status of that service. Even if there is, it would fluctuate a lot. So, Swarm Listener picks up information about the new service and sends the reconfigure request to the proxy. Since the service might not be running at that time, it retries until it receives OK from the proxy or it exceeds the maximum number of retries.

I think you're looking for environment variables DF_RETRY and DF_RETRY_INTERVAL of the Swarm Listener config. They control how many times it will retry the reconfigure request and what will be the interval between retries.

Please try it out and let me know if that's what you're looking for.

vfarcic commented 7 years ago

The latest DFP release changed the way validations are done. Please pull the latest release and try it out. It should remove this problem without the need to increase retry attempts and/or interval.

I'll close the issue. Feel free to reopen it if the problem persists.

patran commented 7 years ago

@vfarcic, let's reopen this one. It does not work with gitlab, which could take 5+ minutes before being ready to provide service. Well, 5 minutes on my system :)

Start proxy
Start gitlab
Wait for gitlab to come up fullly (check with docker ps -a and see that it is healthy and docker logs)
Go to gitlab URL -- server not ready
Start another service A
Service A comes up fine
Go to gitlab URL -- gitlab now works

https://docs.gitlab.com/omnibus/docker/README.html

docker compose file for gitlab below

version: '3.1'

networks:
  qwerty_prod_reverse_proxy:
    external: true
  dvorak_prod_gitlab:
    external: true

volumes:
    gitlab_gitlab_etc_gitlab:
    gitlab_gitlab_log_gitlab:
    gitlab_gitlab_opt_gitlab:

services:
  gitlab_gitlab:
    image: "gitlab/gitlab-ee:9.0.5-ee.0"
    ports:
      - '22'
      - '80'
      - '443'
    networks:
      qwerty_prod_reverse_proxy:
      dvorak_prod_gitlab:
    volumes:
      - gitlab_gitlab_etc_gitlab:/etc/gitlab
      - gitlab_gitlab_log_gitlab:/var/log/gitlab'
      - gitlab_gitlab_opt_gitlab:/var/opt/gitlab'
    deploy:
      labels:
        com.df.notify: "true"
        com.df.distribute: "true"
        com.df.serviceDomain: "gitlab.abc.def.com"
        com.df.servicePath: "/"
        com.df.port: 80
        com.df.setHeader: "X-Forwarded-Port %[dst_port]"
        com.df.addHeader: "X-Forwarded-Ssl on if { ssl_fc }, X-Forwarded-Proto https if { ssl_fc }, X-Forwarded-Protocol https if { ssl_fc }, X-Url-Scheme https if { ssl_fc }"

      resources:
        limits:
          cpus: "0.000"
          memory: "16g"
        reservations:
          cpus: "0.000"
          memory: "16g"

      mode: "replicated"
      replicas: 1
      update_config:
        parallelism: 1
        delay: "60s"

      placement:
        constraints:
          - node.labels.dvorak_prod_gitlab_gitlab == yes

    environment:
      GITLAB_OMNIBUS_CONFIG: |
        external_url 'https://gitlab.abc.def.com'
        # Add any other gitlab.rb configuration here, each on its own line
        #nginx['redirect_http_to_https'] = false
        #nginx['ssl_certificate'] = "/etc/gitlab/ssl/notinuse.com.crt"
        #nginx['ssl_certificate_key'] = "/etc/gitlab/ssl/notinuse.com.key"
        nginx['listen_port'] = 80
        nginx['listen_https'] = false
        nginx['proxy_set_headers'] = { "X-Forwarded-Proto" => "https", "X-Forwarded-Ssl" => "on" }
        gitlab_rails['lfs_enabled'] = true
        gitlab_rails['gitlab_email_from'] = "gitlab@gitlab.abc.def.com"

      DOCKER_SERVICE_NAME: "{{.Service.Name}}"
      DOCKER_SERVICE_ID: "{{.Service.ID}}"
      DOCKER_SERVICE_LABELS: "{{.Service.Labels}}"
      DOCKER_NODE_ID: "{{.Node.ID}}"
      DOCKER_TASK_ID: "{{.Task.ID}}"
      DOCKER_TASK_NAME: "{{.Task.Name}}"
      DOCKER_TASK_SLOT: "{{.Task.Slot}}"

vfarcic commented 7 years ago

Sorry for not responding earlier. DockerCon finished and I'm about to go back home. I'll take a look at this issue on Monday. I hope that's not too late.

vfarcic commented 7 years ago

@patran Can you confirm that DF_RETRY and DF_RETRY_INTERVAL are longer (when multiplied) then the time it takes to pull and initialize GitLab?

patran commented 7 years ago

@vfarcic, confirmed.

DF_RETRY_INTERVAL - tested with 5s and 7s and worked as expected DF_RETRY - set to 400, did not count, but certainly, the desired end effect -- support apps that would take a 5+ minutes to be ready worked -= was achieved.

Tested with gitlab getting pulled and initialized.

Btw, I had interpreted one of your comments as apps such as gitlab would get detected by proxy and work properly even without having to specify DF_RETRY/DF_RETRY_INTERVAL. Just to confirm: without the retry settings, I could not get the proxy to detect gitlab reliably and reconfigure haproxy properly

vfarcic commented 7 years ago

The proxy has defaults that work correctly in most (not all cases). Normally, it should not take more than a couple of seconds to pull an image and create containers. By default, Swarm Listener will repeat a request fifty times with five seconds pause between each. That makes it a little over four minutes. Such a default is more than enough in most cases.

The reason why there is a maximum number of retries lies in a possible never ending loop. One might create a service that never initializes. In such a case, without a max. number of retries, Swarm Listener would loop continuously.

It's not that the proxy could not detect GitLab. The problem is that GitLab (together with your probably slow bandwidth) takes too much to pull so the endpoint (DNS) created by Docker Overlay network was delayed quite a lot. As a result, Proxy thought that the service does not exist.

I'm not sure whether I managed to explain the logic behind it. Please let me know if I didn't and I'll try to be more descriptive.

I'd be more than happy to improve the code if you have a suggestion.

patran commented 7 years ago

The DF_RETRY and associated algorithm makes perfect sense. Along with the ability to instruct the proxy to reload, I think, situations such as, slow bandwidth, temporary network partition, longer periods of communication impairment, traffic overload etc. are well covered.

Btw, for a given instance of the proxy, could you help me understand the design behind how haproxy.cfg gets updated? I am primarily interested in if there ever could be a situation where the haproxy.cfg of one given proxy could get updated simultaneously by multiple "threads" Thanks...

vfarcic commented 7 years ago

haproxy.cfg gets updated on every reconfigure or remove request. Each request is handled in a separate subroutine as a way to avoid bottlenecks. However, the function that writes the file is synced so that only one write can happen at any given moment and avoid potential corruption if multiple writes happen at the same time. In other words, requests handling is done as multiple subroutines but writing the config is synchronous.

Please let me know if I explained it well. If not, I'll get back to you with a more detailed description and/or relevant parts of the code.

I think this ticket can be closed. Feel free to reopen it if you disagree.

drozzy commented 6 years ago

I'm facing this issue when I deploy a "bad" service. By the time I fix it, and try to push the new build though the pipeline, the proxy listener gives me this error:

"Max retries exceeded with url"

In my case, it is the nib0r/docker-flow-proxy-letsencrypt service that gets this error...

vfarcic commented 6 years ago

@drozzy Can you confirm that you're using dockerflow/docker-flow-proxy and not the one from this project? We moved it from vfarcic to dockerflow a while ago.

drozzy commented 6 years ago

@vfarcic this issue has gone away. In general, I found proxy to be working correctly, so ignore my earlier report.

Yes, I am using new docker flow proxy now.

vfarcic / docker-flow-proxy

Timeout and Retry? #190