tendrilinc / marathon-autoscaler

A simple autoscaler for Marathon applications
https://hub.docker.com/r/tendril/marathon-autoscaler/
Apache License 2.0
41 stars 16 forks source link

Scaler not working #14

Open cookandy opened 7 years ago

cookandy commented 7 years ago

Hi,

Thanks for the project - and thanks for the many enhancements recently. I'm having trouble getting the autoscaler working. I just built the docker image and ran it with:

docker run -e MESOS_URI=http://x.x.x.x:5050 -e MARATHON_URI=http://x.x.x.x:8080 -e AGENT_PORT=5051 marathon-autoscaler

From the startup logs, it looks like everything is running ok:

2016-10-21 17:09:26,543 INFO supervisord started with pid 1
2016-10-21 17:09:27,547 INFO spawned: 'marathon_autoscaler' with pid 7
2016-10-21 17:09:28,142 | INFO | Namespace(agent_port=5051, cpu_fan_out=None, datadog_api_key=None, datadog_app_key=None, datadog_env=None, enforce_version_match=False, log_config='/app/logging_config.json', marathon_pass=None, marathon_uri='http://x.x.x.x:8080', marathon_user=None, mesos_uri='http://x.x.x.x:5050', rules_prefix='mas_rule', sleep_interval=5)  (<module>:92)
2016-10-21 17:09:28,144 | INFO | Mesos and Marathon Connections Established.  (start:157)
2016-10-21 17:09:29,147 INFO success: marathon_autoscaler entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-10-21 17:09:34,543 | INFO | Stats differentials collected.  (poll:86)
2016-10-21 17:09:34,544 | INFO | Decision process beginning.  (decide:56)
2016-10-21 17:09:34,545 | INFO | Decisions are completed.  (start:165)
2016-10-21 17:09:45,836 | INFO | Stats differentials collected.  (poll:86)
2016-10-21 17:09:45,837 | INFO | Decision process beginning.  (decide:56)
2016-10-21 17:09:45,837 | INFO | Decisions are completed.  (start:165)
2016-10-21 17:09:57,089 | INFO | Stats differentials collected.  (poll:86)

However, the app never scales under load with the following labels (using small durations for testing):

  "labels": {
  "use_marathon_autoscaler": "true",
  "min_instances": "1",
  "max_instances": "10",
  "mas_rule_fastscaleup_1": "cpu | >10 | PT10S | 3 | PT1M30S",
  "mas_rule_fastscaleup_2": "memory | >50 | PT10S | 3 | PT1M30S"
  }

Do you see any reason why this wouldn't work?

cookandy commented 7 years ago

I also wanted to mention that I am using Mesos 1.0.1 and Marathon 1.3.2.

kernelpanek-segfault commented 7 years ago

Change your use_marathon_autoscaler:

"labels": {
  "use_marathon_autoscaler": "0.0.3",
  "min_instances": "1",
  "max_instances": "10",
  "mas_rule_fastscaleup_1": "cpu | >10 | PT10S | 3 | PT1M30S",
  "mas_rule_fastscaleup_2": "memory | >50 | PT10S | 3 | PT1M30S"
}

TL;DR; We originally used use_marathon_autoscaler as just a boolean to check for an application's participation. Later, when we began improving the thresholds capabilities to be aware of the newer format you're using, we look for the version number where we introduced the newer thresholds format. We still accept a boolean value, but the autoscaler assumes you're using the older thresholds format if you use true. This was done so original applications could keep their older format and the autoscaler would be backward compatible.

Just for reference: We have not yet moved our Mesos clusters to that new version you are running, but in the future you could place a modified logging_config.json on your Marathon agent hosts, put a volume in the app definition for the autoscaler that points to this json file and run the autoscaler with the LOG_CONFIG environment variable set to the container path to this modified logging configuration.

In the logging_config.json, set /root/level to DEBUG:

    "root": {
        "level": "DEBUG",
        "handlers": ["console"]
    }

This will give you more insight with the decision making process and I believe we also remark on which version of the thresholds it plans to look for based on the use_marathon_autoscaler value.

cookandy commented 7 years ago

hi @kernelpanek-tendril - thanks for the response.

I enabled the DEBUG logging and changed the labels as you suggested, but I'm not getting much back. Perhaps my version of Mesos?

2016-10-27 20:54:09,694 | INFO | Namespace(agent_port=5051, cpu_fan_out=None, datadog_api_key=None, datadog_app_key=None, datadog_env=None, enforce_version_match=False, log_config='/logging-config.json', marathon_pass=None, marathon_uri='http://10.10.22.239:8080', marathon_user=None, mesos_uri='http://10.10.22.239:5050', rules_prefix='mas_rule', sleep_interval=5)
2016-10-27 20:54:09,695 | INFO | Mesos and Marathon Connections Established.
2016-10-27 20:54:09,811 | DEBUG | (u'10.10.23.59', [])
2016-10-27 20:54:09,847 | DEBUG | (u'10.10.15.87', [])
2016-10-27 20:54:09,914 | DEBUG | (u'10.10.26.172', [])
2016-10-27 20:54:09,935 | DEBUG | (u'10.10.26.178', [])
2016-10-27 20:54:10,938 INFO success: marathon_autoscaler entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-10-27 20:54:13,015 | DEBUG | (u'10.10.23.59', [])
2016-10-27 20:54:13,065 | DEBUG | (u'10.10.15.87', [])
2016-10-27 20:54:13,147 | DEBUG | (u'10.10.26.178', [])
2016-10-27 20:54:13,170 | DEBUG | (u'10.10.26.172', [])
2016-10-27 20:54:16,174 | INFO | Stats differentials collected.
2016-10-27 20:54:16,175 | INFO | Decision process beginning.
2016-10-27 20:54:16,175 | INFO | Decisions are completed.
2016-10-27 20:54:21,280 | DEBUG | (u'10.10.23.59', [])
2016-10-27 20:54:21,313 | DEBUG | (u'10.10.15.87', [])
2016-10-27 20:54:21,391 | DEBUG | (u'10.10.26.172', [])
2016-10-27 20:54:21,493 | DEBUG | (u'10.10.26.178', [])
2016-10-27 20:54:24,571 | DEBUG | (u'10.10.23.59', [])
2016-10-27 20:54:24,611 | DEBUG | (u'10.10.15.87', [])
2016-10-27 20:54:24,651 | DEBUG | (u'10.10.26.172', [])
2016-10-27 20:54:24,721 | DEBUG | (u'10.10.26.178', [])

it continues on like that, without ever changing...

kernelpanek-segfault commented 7 years ago

We don't have that combination of Mesos/Marathon to test with, but based on what I see in your log (rather what I don't see) is the agent data... Each agent is coming back with nothing for its statistics. Each of those new DEBUG lines is showing an empty list for the call when it's querying the /monitor/statistics path of each agent.

If you can ssh to the agent where the autoscaler is running, try to curl the path on one of the other agents:

curl http://10.10.23.59:5051/monitor/statistics

Does it timeout, respond with something other than HTTP 200, or simply give back no data? My instincts tell me that we grossly underestimated the number of ways we can all configure Mesos and Marathon and our code does not accommodate for all of those possibilities, yet.

cookandy commented 7 years ago

Looks like I get a 200, but nothing there:

root@03:~# curl -v http://10.10.23.59:5051/monitor/statistics
*   Trying 10.10.23.59...
* Connected to 10.10.23.59 (10.10.23.59) port 5051 (#0)
> GET /monitor/statistics HTTP/1.1
> Host: 10.134.15.87:5051
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Fri, 28 Oct 2016 19:11:21 GMT
< Content-Length: 2
< Content-Type: application/json
<
* Connection #0 to host 10.134.15.87 left intact
[]

Just []

kernelpanek commented 7 years ago

Do you run multiple instances of Mesos agents on the same hosts where the path would be?:

curl -v http://10.10.23.59:5051/slave(1)/monitor/statistics
cookandy commented 7 years ago

No, just a single Mesos agent per host.

root@03:~# curl -v "http://10.10.23.59:5051/slave(1)/monitor/statistics"
*   Trying 10.10.23.59...
* Connected to 10.10.23.59 (10.10.23.59) port 5051 (#0)
> GET /slave(1)/monitor/statistics HTTP/1.1
> Host: 10.134.23.59:5051
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Sat, 29 Oct 2016 04:05:54 GMT
< Content-Length: 2
< Content-Type: application/json
<
* Connection #0 to host 10.134.23.59 left intact
[]
kernelpanek commented 7 years ago

I'm not sure why your agents are not posting their executor stats, but I'm curious enough to find out how Mesos ends up in the configuration where they stop reporting executor statistics.

Just out of curiousity, does your Mesos master web UI report the CPU and memory usage of executors? When I wrote this app, I used the same approach the (web UI) Javascript takes to calculate CPU and memory usage.

cookandy commented 7 years ago

hi @kernelpanek

I'm happy to try and help troubleshoot... looks like the Mesos agents are reporting the correct number of CPUs and Memory:

screen shot 2016-10-28 at 9 42 26 pm
kernelpanek commented 7 years ago

Click on one of the agent ID links and that should bring up the page that I'm thinking does dynamic updates on CPU/memory utilization. The utilization shows on the left-hand side of the page.

cookandy commented 7 years ago

oh you're right - weird! the resource usage shows 0

screen shot 2016-10-28 at 9 56 04 pm
kernelpanek commented 7 years ago

Are you running on your own physical hardware or VMs? I'm grasping at straws here, but I'm not sure how the new resource isolators work in Mesos. The older versions of Mesos don't show the network statistics in the /monitor/statistics endpoint unless the isolator is enabled. The difference is that the basic cpu and memory metrics still show. I've seen screenshots of other Mesos setups where the number of individual metrics outnumbers what is shown in our Mesos clusters. The last time I tried to hunt down how we could get our Mesos cluster to report those better details, I found myself slumped over in my couch from staying up all night reading through cgroup C++ code.. Various installations can have wildly inconsistent statistics pages.

cookandy commented 7 years ago

I'm running on VMs in a a VPS. It's strange that I can't get anything out of the /monitor/statistics endpoint, but I am able to get metrics out of /metrics/snapshot. Would this endpoint provide enough detail to use the autoscaler?

I have checked with other Mesos users and have heard similar reports of empty responses using the /monitor/statistics endpoint.

kernelpanek-segfault commented 7 years ago

Unfortunately, /metrics/snapshot only gives the high-level stats of the agent instance as a whole and in order to know a marathon application's usage or impact on that agent, we want the application performance details which is in the /monitor/statistics endpoint. The /state endpoint would be good if it actually gave utilization information and not just allocation information.

Even though your endpoint gives back an HTTP 200, are you using authentication on your agents? I'm wondering if this story was implemented in such a way that the endpoint still responds with a HTTP 200, but gives up nothing if no authentication header info is present. If that's the case, I'll gladly accept your PR if you implement auth for agent endpoint querying.

Ultimately, I'm going to open an issue to incorporate testing with multiple mesos/marathon versions using Dell/Emc's vagrant repo and see if we can squeeze this into TravisCI somehow.

cookandy commented 7 years ago

Hi @kernelpanek-tendril. I was able to get to the bottom of my problem...

I forgot to mention that I am running Mesos from within a Docker container. I started looking at the logs when I made the API request and saw errors like:

Failed to get resource statistics for executor 'sysdig-agent.1951556a-9ad9-11e6-a24a-0242aec327c9' of framework efaaca88-e937-4288-b929-2a0bd940e70a-0000: Failed to collect cgroup stats: Failed to determine cgroup for the 'cpu' subsystem: Failed to read /proc/2653/cgroup: Failed to open file: No such file or directory

...even though I had /proc mapped to /host/proc. I was under the impression that Mesos could read from /host/proc, but that's not the case. In order to get Mesos to find the process IDs, I had to run the Mesos container with --pid=host.

Thanks for the help! I will start digging into the autoscaler now. :)

cookandy commented 7 years ago

@kernelpanek-tendril:

I got the /monitor/statistics endpoint working, however, the app never scales under load with the following labels (using small durations for testing):

  "labels": {
  "use_marathon_autoscaler": "0.0.3",
  "min_instances": "1",
  "max_instances": "10",
  "mas_rule_fastscaleup_1": "cpu | >10 | PT10S | 3 | PT1M30S",
  "mas_rule_fastscaleup_2": "memory | >50 | PT10S | 3 | PT1M30S"
  }

I can see in the logs that mem usage and cpu usage is above what I specified.

2016-11-01 23:23:46,205 | INFO | Stats differentials collected.
2016-11-01 23:23:46,207 | INFO | Decision process beginning.
2016-11-01 23:23:46,208 | DEBUG | Version matching is enforced. Version: 0.0.3
2016-11-01 23:23:46,208 | DEBUG | content-providers: participating
2016-11-01 23:23:46,208 | INFO | content-providers: metrics: {'mem': 28.426930581835386, 'cpu': 49.20513632498708}
2016-11-01 23:23:46,208 | INFO | content-providers: last_triggered_rule set to: None
2016-11-01 23:23:46,208 | INFO | content-providers: vote: 0 ; scale_factor requested: 0
2016-11-01 23:23:46,209 | INFO | content-providers: application ready: True
2016-11-01 23:23:46,209 | INFO | content-providers: instances: min:1, running:1, max:10
2016-11-01 23:23:46,211 | DEBUG | Version matching is enforced. Version: 0.0.3
2016-11-01 23:23:46,212 | DEBUG | content-providers: participating
2016-11-01 23:23:46,213 | INFO | Decisions are completed.

Any ideas?

kernelpanek-segfault commented 7 years ago

Your metric:

2016-11-01 23:23:46,208 | INFO | content-providers: metrics: {'mem': 28.426930581835386, 'cpu': 49.20513632498708}

Your fastscaleup rule:

  "mas_rule_fastscaleup_1": "cpu | >10 | PT10S | 3 | PT1M30S",
  "mas_rule_fastscaleup_2": "memory | >50 | PT10S | 3 | PT1M30S"

What your rule is set up for is: "If CPU utilization is greater than 10% and memory utilization is greater than 50% for 1 minute, 30 seconds; then scale the application +3 instances." However, your memory usage has not reached the >50 threshold, but your CPU utilization has surpassed the threshold.

If you are wanting the application to scale on either memory or CPU, then split the rule by providing different rule names, like so:

  "mas_rule_fastcpuscaleup_1": "cpu | >10 | PT10S | 3 | PT1M30S",
  "mas_rule_fastmemscaleup_1": "memory | >50 | PT10S | 3 | PT1M30S"
cookandy commented 7 years ago

Thanks @kernelpanek-tendril

I adjusted the rules as you suggested, and was able to cause the autoscaler to trigger. However, I never see new instances, despite seeing scale_factor requested: 3 in the logs..

2016-11-02 04:24:46,137 | INFO | Decision process beginning.
2016-11-02 04:24:46,138 | DEBUG | Version matching is enforced. Version: 0.0.3
2016-11-02 04:24:46,138 | DEBUG | content-providers: participating
2016-11-02 04:24:46,138 | INFO | content-providers: metrics: {'mem': 16.858591532639547, 'cpu': 36.4259688760453}
2016-11-02 04:24:46,138 | INFO | content-providers: last_triggered_rule set to: [{'ruleInfo': {'rulePart': u'1', 'ruleName': u'fastcpuscaleup'}, 'ruleValue': {'scale_factor': u'3', 'weight': 1.0, 'threshold': {'val': u'10', 'op': u'>'}, 'metric': u'cpu', 'tolerance': u'PT10S', 'backoff': u'PT1M30S'}}]
2016-11-02 04:24:46,138 | INFO | content-providers: vote: 1 ; scale_factor requested: 3
2016-11-02 04:24:46,139 | INFO | content-providers: application ready: True
2016-11-02 04:24:46,139 | INFO | content-providers: instances: min:1, running:1, max:10
2016-11-02 04:24:46,139 | INFO | content-providers: tolerance window filled: True / 04:24:36.139353
2016-11-02 04:24:46,139 | DEBUG | content-providers: tolerance window filled: True / 04:24:36.139353
2016-11-02 04:24:46,139 | INFO | content-providers: within backoff window: False / 04:23:16.139694 - 04:24:46.139694
2016-11-02 04:24:46,139 | DEBUG | content-providers: []
2016-11-02 04:24:46,140 | DEBUG | Version matching is enforced. Version: 0.0.3
2016-11-02 04:24:46,140 | DEBUG | content-providers: participating
2016-11-02 04:24:46,140 | INFO | Decisions are completed.

What happens here? DEBUG | content-providers: []

kernelpanek-segfault commented 7 years ago

That line with the empty list is the past decisions in the backoff window (or timeframe). No items should result in returning False with regard to checking if the system is within the backoff period. However, that isn't the problem. I'm concerned with there not being any log entry for "tolerance reached: {True/False}" before the backoff window check, but after a positive "tolerance window filled: True" entry. I see the issue with the code on why this happens, but the tolerance has not yet been met because the performance tail has nothing in it. I'll submit a fix for this: #19.

cookandy commented 7 years ago

hi @kernelpanek-tendril

Looks like tendrilinc/marathon-autoscaler/pull/21 was merged, but I'm still having trouble with my application scaling.

Here's my rule:

  "labels": {
      "use_marathon_autoscaler": "0.0.3",
      "min_instances": "1",
      "max_instances": "10",
      "mas_rule_fastcpuscaleup_1": "cpu | >10 | PT10S | 3 | PT30S"
  }

And here is the log from the autoscaler when I've got load on itL

2016-11-28 19:52:41,534 | INFO | Stats differentials collected.
2016-11-28 19:52:41,536 | INFO | Decision process beginning.
2016-11-28 19:52:41,537 | DEBUG | Version matching is enforced. Version: 0.0.3
2016-11-28 19:52:41,538 | DEBUG | content-providers: participating
2016-11-28 19:52:41,538 | INFO | content-providers: metrics: {'mem': 25.595185250946972, 'cpu': 112.3698662867287}
2016-11-28 19:52:41,539 | INFO | content-providers: last_triggered_rule set to: [{'ruleInfo': {'rulePart': u'1', 'ruleName': u'fastcpuscaleup'}, 'ruleValue': {'scale_factor': u'3', 'weight': 1.0, 'threshold': {'val': u'10', 'op': u'>'}, 'metric': u'cpu', 'tolerance': u'PT10S', 'backoff': u'PT30S'}}]
2016-11-28 19:52:41,540 | INFO | content-providers: vote: 1 ; scale_factor requested: 3
2016-11-28 19:52:41,540 | INFO | content-providers: application ready: True
2016-11-28 19:52:41,540 | INFO | content-providers: instances: min:1, running:1, max:10
2016-11-28 19:52:41,541 | INFO | content-providers: tolerance window filled: True / 19:52:31.541472
2016-11-28 19:52:41,542 | DEBUG | content-providers: tolerance window filled: True / 19:52:31.541472
2016-11-28 19:52:41,542 | INFO | content-providers: tolerance reached: False / 19:52:31.541472 - 19:52:41.541472
2016-11-28 19:52:41,542 | DEBUG | content-providers: vote_list: []
2016-11-28 19:52:41,543 | INFO | content-providers: within backoff window: False / 19:52:11.543266 - 19:52:41.543266
2016-11-28 19:52:41,543 | DEBUG | content-providers: []
2016-11-28 19:52:41,544 | DEBUG | Version matching is enforced. Version: 0.0.3
2016-11-28 19:52:41,544 | DEBUG | content-providers: participating
2016-11-28 19:52:41,545 | INFO | Decisions are completed.

I am now seeing the Tolerance reached message, but it is reporting False. Any ideas on why?

kernelpanek-segfault commented 7 years ago

@cookandy Can you send a longer log of activity via a gist?

cookandy commented 7 years ago

hi @kernelpanek-tendril

Here is a longer log... hopefully it's helpful:

https://gist.github.com/cookandy/a03562e41e3eeec11bafe13f2076f0dc

cookandy commented 7 years ago

Any ideas @kernelpanek-tendril?

kernelpanek-segfault commented 7 years ago

@cookandy I think I found my mistake in one of my prior commits in the history_manager, where I was attempting to remove an early return statement. The history_manager did not have any good unit tests, so the change was not flagged. I've been working on corrections for all of this... a fix, and tests for history_manager.

cookandy commented 7 years ago

brilliant, thanks @kernelpanek-tendril! looking forward to the fixes :)

jmparraguez-haulmer commented 7 years ago

@kernelpanek-tendril when you say

Your fastscaleup rule:

"mas_rule_fastscaleup_1": "cpu | >10 | PT10S | 3 | PT1M30S",
"mas_rule_fastscaleup_2": "memory | >50 | PT10S | 3 | PT1M30S"

What your rule is set up for is: "If CPU utilization is greater than 10% and memory utilization is greater than 50% for 1 minute, 30 seconds; then scale the application +3 instances." However, your memory usage has not reached the >50 threshold, but your CPU utilization has surpassed the threshold.

how fits PT10S on the sentence

cookandy commented 7 years ago

any update @kernelpanek-tendril?

kernelpanek-segfault commented 7 years ago

@jmparra The PT10S is the backoff timespan. Once a rule is triggered and acted upon, the same rule should not be acted upon for the duration of the backoff timespan after the last action of the rule. So, when the mas_rule_fastscaleup rule is used to scale the cluster up by 3 at 12:00:00am, it cannot be used again until 12:00:10am.

kernelpanek-segfault commented 7 years ago

@cookandy Sorry for the delay. I got terribly sick after re:Invent this year and had a project with an EOCY deadline to finish. However, I am now on holiday break, so I am picking up where I left off.