mesosphere / marathon-lb

Marathon-lb is a service discovery & load balancing tool for DC/OS
Apache License 2.0
449 stars 300 forks source link

can marathon-lb be deployed in not "slave_public" mesos nodes? #343

Closed tomasfse closed 7 years ago

tomasfse commented 7 years ago

Hello,

I am having problems deploying an internal Marathon LB in not slave_public mesos slaves, it stucks always in "waiting" state. I can see how other person has got it, and there are docs about this scenario in this github repository and in the dcos documentation (https://dcos.io/docs/1.8/usage/service-discovery/marathon-lb/usage/). I can only think that I'm missing some info.

This guy (@dariusjs) has a very similar scenario:

https://github.com/mesosphere/marathon-lb/issues/263

However, he got solve the issue with constraints, I suspect that all the nodes where he is deploying marathon-lb have the "slave_public" role, but I'm not sure

I can deploy Marathon-lb in the "external" way without problems and totally functional, the problem comes when I deploy the package with the "internal" options, with no roles (or the "*" role)

$dcos package install --options=options.json marathon-lb
{
  "marathon-lb":{
    "name":"marathon-lb-internal",
    "haproxy-group":"internal",
    "bind-http-https":false,
    "role":""
  }
}

This won't work in any way. There are two slave nodes, one of them has "slave_public" role, and the another one has the default role ("*"). There are not other services running. This has been reproduced by to colleages in their own dcos clusters. We are using DCOS 1.8.5 "brand new", with marathon-lb 1.4.1

If I leave "role" field empty, when I edit the service after installed I realize that Marathon inserts the field "acceptedResourceRoles: ["slave_public"]" by default

If I define "role" to "*", marathon-lb-internal app stucks in "waiting" state forever, there are no deploy tries or logs, and the mesos master doesn't seem to realize of its existence in the resources page.

Seems like a confusion with the slave roles and the app cannot allocate resources, or maybe I'm trying to use marathon-lb for something it was not designed and I'm not understanding well the docs.

So, can Marathon-LB be deployed in not slave_public slaves? The not public mesos slaves needs to be tagged in a special way?

Thanks

brndnmtthws commented 7 years ago

What you're doing should indeed work. Can you confirm that you have enough CPU, memory, and ports available? Here's the list of default ports: https://github.com/mesosphere/universe/blob/version-3.x/repo/packages/M/marathon-lb/13/marathon.json.mustache#L80-L182

tomasfse commented 7 years ago

@brndnmtthws

Resources CPUs GPUs Mem Disk Total 4. 0. 9.3 GB 50.7 GB Used 0. 0. 0 B. 0 B Offered 0. 0. 0 B. 0 B Idle. 4. 0. 9.3 GB 50.7 GB

The two nodes are totally free with 2 CPUs and 6gb RAM each one.

The slaves configuración is the default one after the dcos 1.8.5 installation and there are not other ports in use than the installation default.

I also tried to destroy the marathon-lb-external before deploy the marathon-lb-internal but nothing changes, seems a problem deploying to not slave_public role nodes but not sure if that make sense.

Something that I can trace to catch the error?

Thanks

dariusjs commented 7 years ago

Can you post the marathon config or sections of it? I definitely did not have problems putting it on non-publicslaves, only wanted to fix it to certain hosts with a Mesos property and the cluster option helped me.

tomasfse commented 7 years ago

@dariusjs you mean the marathon-lb-internal app config? or the global marathon config?

brndnmtthws commented 7 years ago

I've tested your options.json, and everything works correctly for me. There must be something peculiar about your setup.

Did you at any point run any stateful frameworks like Cassandra or Kafka?

mvanholsteijn commented 7 years ago

I ran into this on DC/OS too. Only the public_master mesos machines offer port 80 and 443, so that marathon-lb can be deployed.

You need to add these ports to the mesos slave resources along the following lines:

MESOS_RESOURCES='ports(*):[80-80,443-443,... other port ranges...]'

I have not found an easy way of doing that in the standard DC/OS installation.

dariusjs commented 7 years ago

@mvanholsteijn The system is flexible so you can do many things. But its not a necessity at least for DC/OS v 1.7.

On our environments I've done this:

cat /var/lib/dcos/mesos-slave-common MESOS_ATTRIBUTES=category:internallb

Then when deploying marathon-lb-internal I apply this to the marathon task definition `

"constraints": [ [ "category", "LIKE", "internallb" ], [ "hostname", "UNIQUE" ] ], `

This is on an internal node with this: /etc/mesosphere/roles/slave

brndnmtthws commented 7 years ago

@mvanholsteijn

I ran into this on DC/OS too. Only the public_master mesos machines offer port 80 and 443, so that marathon-lb can be deployed.

That's not quite correct. While it may solve the problem for you, it's neither required nor considered good practice. Furthermore, there is no such thing as "public_master mesos machines".

There's a flag in the package for accomplishing it: https://github.com/mesosphere/universe/blob/version-3.x/repo/packages/M/marathon-lb/13/marathon.json.mustache#L76-L79

If you follow the instructions, and use the same options.json as above, it should work correctly.

tomasfse commented 7 years ago

Thanks for all the replies. It is really apreciated.

@mvanholsteijn your info helped me to do some more tests changing de MESOS_RESOURCES, and finally I know what happens...

It is a combination between the 80/443 ports binding and the cpu and memory resources allocation only for my scenario.

This does not work (by default marathon-lb takes 2 cpus)

{
  "marathon-lb":{
    "name":"marathon-lb-internal",
    "haproxy-group":"internal",
    "bind-http-https":false,
    "role":""
  }
}

But this works:

{
  "marathon-lb":{
    "name":"marathon-lb-internal",
    "haproxy-group":"internal",
    "bind-http-https":false,
    "role":"",
    "cpus": 1,
    "mem": 512
  }
}

So, even if my slave has 2 cpus available seems that is not completely real and marathon-lb cannot allocate those 2 cpus. (¿Is it this something expected?) in the slave_public node there is not this problem.

In the other hand, In my first tests before post here, I only reduced the cpu and mem of marathon-lb with "bind-http-https" defined as true, and that is why my app didn't deploy, when I added the "80-80,443-443" range to MESOS_RESOURCES with the appropiate cpu and memory resources defined, marathon-lb-internal started to work.

I continue ignoring why a service cannot allocate the 2 cpus of the mesos slave if they are available, but this is a question for another issue.

Thanks you very much for the help.

mvanholsteijn commented 7 years ago

@protheantom Marathon will show your applications in the state 'Waiting' when there insufficient resources to run your load balancer. Most commonly this means there is not sufficient CPU, insufficient memory or the required ports are unavailable.

Checkout your mesos master slave configuration and you will probably find that the mesos slave is offering slightly less than 2 cpu free.

brndnmtthws commented 7 years ago

Let me clarify some of the misinformation:

When you launch an app on Marathon, the amount of CPU and memory resources required is equal to the sum of the CPU for the app, plus the CPU for the executor, plus the memory for the a app, and the memory for the executor.

It's like:

Total cpu = cpu(app) + cpu(exec) Total mem = mem(app) + mem(exec)

By default, Marathon use 0.1 CPUs for the executor, and 32MiB of memory (if I recall correctly--I can't find the docs).

So, an app configured with 2 CPUs and 2048 MiB of memory will require 2.1 CPUs and 2080 MiB of memory.

Happy computering!