hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.86k stars 1.95k forks source link

[Feature request] Canary-only services #8073

Open urusha opened 4 years ago

urusha commented 4 years ago

It would be nice if we had an ability to create temporary (canary only) services. Our use case follows: We create 2 services for traefik2 routing. One is for canaries (to route by header X-Canary-Flag), another is for promoted tasks (without header). This seems to work fine, but after promotion canary service is not needed any more, while it stays unrouted till next deployment, it still requires consul-checks (actual http port in the example is checked twice by both (canary, non-canary) consul service checkers). Actually all consul checks for services deployed this way are getting multiplied by 2. So it would be great to have canary (bool: false) service parameter implementing this.

This is somehow related to https://github.com/hashicorp/nomad/issues/2920#issuecomment-353386502

    service {
        name = "${NOMAD_META_service_name}"
        port = "http"

        tags = [
          "traefik.enable=true",
          "traefik.http.services.${NOMAD_META_service_name}.loadbalancer.healthcheck.path=/status/",
          "traefik.http.routers.${NOMAD_META_service_name}.entrypoints=web",
          "traefik.http.routers.${NOMAD_META_service_name}.rule=Host(${NOMAD_META_http_hosts})"
        ]
        canary_tags = ["deploy"]

        check {
          type = "http"
          path = "/status/"
        }
      }

      service {
//      Not implemented destroying after promotion
//      canary = true
        name = "${NOMAD_META_service_name}-canary"
        port = "http"

        tags = ["deploy"]
        canary_tags = [
          "traefik.enable=true",
          "traefik.http.services.${NOMAD_META_service_name}-canary.loadbalancer.healthcheck.path=/status/",
          "traefik.http.routers.${NOMAD_META_service_name}-canary.entrypoints=web",
          "traefik.http.routers.${NOMAD_META_service_name}-canary.rule=Host(${NOMAD_META_http_hosts})&&Headers(`X-Canary-Flag`,`1`)"
        ]

        check {
          type = "http"
          path = "/status/"
        }
      }
breathe commented 2 years ago

I have what I think is a similar issue with using nomad <-> consul <-> traefik together ... I created a ticket on the traefik repo to describe my problem -- https://github.com/traefik/traefik/issues/8987

The issue I have, and I think the same issue here, stems from the fact that traefik sees the canary and non-canary instances as the same 'traefik service' ... the routing policy can be independently defined for canary/non-canary via canary_tags/tags -- but the instances are registered in traefik in such a way that traefik sees only one service consisting of all instances of both canary and non-canary types ... This means you can't define a routing policy in traefik to independently route to only canary or only non-canary instances.

The above feature might provide one workaround -- however it seems somewhat inelegant to have to define two separate services ... I would rather have a way to somehow persuade nomad/traefik to present the canary instances as a separate 'traefik service' automatically and not have to define two services for this ...

urusha commented 2 years ago

@breathe We are using this scheme in production. With one important note: all routing settings (like rule Host()) in traefik.http.routers.router-name.* of main (non-canary) service should be constant in both versions of the job. If it is not so, traefik will drop the router which has more than one unique router configuration, e.g. modified Host() (and this happens due to promotion is not instant, especially if you have large number of task instances and nomad nodes, actually this could be reproduced on a single nomad node with nginx tasks count=3 and canary=3, traefik's consul refreshinterval:1s ). One way to fix this is to add some job version identifier to router's name, like this: traefik.http.routers.${NOMAD_META_service_name}-${NOMAD_META_deploy_number}.rule=Host(${NOMAD_META_http_hosts}) So, that the old and the new task version will use different routes to the same service. And about https://github.com/traefik/traefik/issues/8987 I think it is really nice idea to get final traefik service name by modifying consul service name with suffix from tag, or even replacing full service name with the value from the tag.