Open cookandy opened 7 years ago
I also wanted to mention that I am using Mesos 1.0.1
and Marathon 1.3.2
.
Change your use_marathon_autoscaler
:
"labels": {
"use_marathon_autoscaler": "0.0.3",
"min_instances": "1",
"max_instances": "10",
"mas_rule_fastscaleup_1": "cpu | >10 | PT10S | 3 | PT1M30S",
"mas_rule_fastscaleup_2": "memory | >50 | PT10S | 3 | PT1M30S"
}
TL;DR;
We originally used use_marathon_autoscaler
as just a boolean to check for an application's participation. Later, when we began improving the thresholds capabilities to be aware of the newer format you're using, we look for the version number where we introduced the newer thresholds format. We still accept a boolean value, but the autoscaler assumes you're using the older thresholds format if you use true
. This was done so original applications could keep their older format and the autoscaler would be backward compatible.
Just for reference:
We have not yet moved our Mesos clusters to that new version you are running, but in the future you could place a modified logging_config.json on your Marathon agent hosts, put a volume in the app definition for the autoscaler that points to this json file and run the autoscaler with the LOG_CONFIG
environment variable set to the container path to this modified logging configuration.
In the logging_config.json, set /root/level to DEBUG
:
"root": {
"level": "DEBUG",
"handlers": ["console"]
}
This will give you more insight with the decision making process and I believe we also remark on which version of the thresholds it plans to look for based on the use_marathon_autoscaler
value.
hi @kernelpanek-tendril - thanks for the response.
I enabled the DEBUG logging and changed the labels as you suggested, but I'm not getting much back. Perhaps my version of Mesos?
2016-10-27 20:54:09,694 | INFO | Namespace(agent_port=5051, cpu_fan_out=None, datadog_api_key=None, datadog_app_key=None, datadog_env=None, enforce_version_match=False, log_config='/logging-config.json', marathon_pass=None, marathon_uri='http://10.10.22.239:8080', marathon_user=None, mesos_uri='http://10.10.22.239:5050', rules_prefix='mas_rule', sleep_interval=5)
2016-10-27 20:54:09,695 | INFO | Mesos and Marathon Connections Established.
2016-10-27 20:54:09,811 | DEBUG | (u'10.10.23.59', [])
2016-10-27 20:54:09,847 | DEBUG | (u'10.10.15.87', [])
2016-10-27 20:54:09,914 | DEBUG | (u'10.10.26.172', [])
2016-10-27 20:54:09,935 | DEBUG | (u'10.10.26.178', [])
2016-10-27 20:54:10,938 INFO success: marathon_autoscaler entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-10-27 20:54:13,015 | DEBUG | (u'10.10.23.59', [])
2016-10-27 20:54:13,065 | DEBUG | (u'10.10.15.87', [])
2016-10-27 20:54:13,147 | DEBUG | (u'10.10.26.178', [])
2016-10-27 20:54:13,170 | DEBUG | (u'10.10.26.172', [])
2016-10-27 20:54:16,174 | INFO | Stats differentials collected.
2016-10-27 20:54:16,175 | INFO | Decision process beginning.
2016-10-27 20:54:16,175 | INFO | Decisions are completed.
2016-10-27 20:54:21,280 | DEBUG | (u'10.10.23.59', [])
2016-10-27 20:54:21,313 | DEBUG | (u'10.10.15.87', [])
2016-10-27 20:54:21,391 | DEBUG | (u'10.10.26.172', [])
2016-10-27 20:54:21,493 | DEBUG | (u'10.10.26.178', [])
2016-10-27 20:54:24,571 | DEBUG | (u'10.10.23.59', [])
2016-10-27 20:54:24,611 | DEBUG | (u'10.10.15.87', [])
2016-10-27 20:54:24,651 | DEBUG | (u'10.10.26.172', [])
2016-10-27 20:54:24,721 | DEBUG | (u'10.10.26.178', [])
it continues on like that, without ever changing...
We don't have that combination of Mesos/Marathon to test with, but based on what I see in your log (rather what I don't see) is the agent data... Each agent is coming back with nothing for its statistics. Each of those new DEBUG lines is showing an empty list for the call when it's querying the /monitor/statistics
path of each agent.
If you can ssh to the agent where the autoscaler is running, try to curl the path on one of the other agents:
curl http://10.10.23.59:5051/monitor/statistics
Does it timeout, respond with something other than HTTP 200, or simply give back no data? My instincts tell me that we grossly underestimated the number of ways we can all configure Mesos and Marathon and our code does not accommodate for all of those possibilities, yet.
Looks like I get a 200
, but nothing there:
root@03:~# curl -v http://10.10.23.59:5051/monitor/statistics
* Trying 10.10.23.59...
* Connected to 10.10.23.59 (10.10.23.59) port 5051 (#0)
> GET /monitor/statistics HTTP/1.1
> Host: 10.134.15.87:5051
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Fri, 28 Oct 2016 19:11:21 GMT
< Content-Length: 2
< Content-Type: application/json
<
* Connection #0 to host 10.134.15.87 left intact
[]
Just []
Do you run multiple instances of Mesos agents on the same hosts where the path would be?:
curl -v http://10.10.23.59:5051/slave(1)/monitor/statistics
No, just a single Mesos agent per host.
root@03:~# curl -v "http://10.10.23.59:5051/slave(1)/monitor/statistics"
* Trying 10.10.23.59...
* Connected to 10.10.23.59 (10.10.23.59) port 5051 (#0)
> GET /slave(1)/monitor/statistics HTTP/1.1
> Host: 10.134.23.59:5051
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Sat, 29 Oct 2016 04:05:54 GMT
< Content-Length: 2
< Content-Type: application/json
<
* Connection #0 to host 10.134.23.59 left intact
[]
I'm not sure why your agents are not posting their executor stats, but I'm curious enough to find out how Mesos ends up in the configuration where they stop reporting executor statistics.
Just out of curiousity, does your Mesos master web UI report the CPU and memory usage of executors? When I wrote this app, I used the same approach the (web UI) Javascript takes to calculate CPU and memory usage.
hi @kernelpanek
I'm happy to try and help troubleshoot... looks like the Mesos agents are reporting the correct number of CPUs and Memory:
Click on one of the agent ID links and that should bring up the page that I'm thinking does dynamic updates on CPU/memory utilization. The utilization shows on the left-hand side of the page.
oh you're right - weird! the resource usage shows 0
Are you running on your own physical hardware or VMs? I'm grasping at straws here, but I'm not sure how the new resource isolators work in Mesos. The older versions of Mesos don't show the network statistics in the /monitor/statistics
endpoint unless the isolator is enabled. The difference is that the basic cpu and memory metrics still show. I've seen screenshots of other Mesos setups where the number of individual metrics outnumbers what is shown in our Mesos clusters. The last time I tried to hunt down how we could get our Mesos cluster to report those better details, I found myself slumped over in my couch from staying up all night reading through cgroup C++ code.. Various installations can have wildly inconsistent statistics pages.
I'm running on VMs in a a VPS. It's strange that I can't get anything out of the /monitor/statistics
endpoint, but I am able to get metrics out of /metrics/snapshot
. Would this endpoint provide enough detail to use the autoscaler?
I have checked with other Mesos users and have heard similar reports of empty responses using the /monitor/statistics
endpoint.
Unfortunately, /metrics/snapshot
only gives the high-level stats of the agent instance as a whole and in order to know a marathon application's usage or impact on that agent, we want the application performance details which is in the /monitor/statistics
endpoint. The /state
endpoint would be good if it actually gave utilization information and not just allocation information.
Even though your endpoint gives back an HTTP 200, are you using authentication on your agents? I'm wondering if this story was implemented in such a way that the endpoint still responds with a HTTP 200, but gives up nothing if no authentication header info is present. If that's the case, I'll gladly accept your PR if you implement auth for agent endpoint querying.
Ultimately, I'm going to open an issue to incorporate testing with multiple mesos/marathon versions using Dell/Emc's vagrant repo and see if we can squeeze this into TravisCI somehow.
Hi @kernelpanek-tendril. I was able to get to the bottom of my problem...
I forgot to mention that I am running Mesos from within a Docker container. I started looking at the logs when I made the API request and saw errors like:
Failed to get resource statistics for executor 'sysdig-agent.1951556a-9ad9-11e6-a24a-0242aec327c9' of framework efaaca88-e937-4288-b929-2a0bd940e70a-0000: Failed to collect cgroup stats: Failed to determine cgroup for the 'cpu' subsystem: Failed to read /proc/2653/cgroup: Failed to open file: No such file or directory
...even though I had /proc
mapped to /host/proc
. I was under the impression that Mesos could read from /host/proc
, but that's not the case. In order to get Mesos to find the process IDs, I had to run the Mesos container with --pid=host
.
Thanks for the help! I will start digging into the autoscaler now. :)
@kernelpanek-tendril:
I got the /monitor/statistics
endpoint working, however, the app never scales under load with the following labels (using small durations for testing):
"labels": {
"use_marathon_autoscaler": "0.0.3",
"min_instances": "1",
"max_instances": "10",
"mas_rule_fastscaleup_1": "cpu | >10 | PT10S | 3 | PT1M30S",
"mas_rule_fastscaleup_2": "memory | >50 | PT10S | 3 | PT1M30S"
}
I can see in the logs that mem
usage and cpu
usage is above what I specified.
2016-11-01 23:23:46,205 | INFO | Stats differentials collected.
2016-11-01 23:23:46,207 | INFO | Decision process beginning.
2016-11-01 23:23:46,208 | DEBUG | Version matching is enforced. Version: 0.0.3
2016-11-01 23:23:46,208 | DEBUG | content-providers: participating
2016-11-01 23:23:46,208 | INFO | content-providers: metrics: {'mem': 28.426930581835386, 'cpu': 49.20513632498708}
2016-11-01 23:23:46,208 | INFO | content-providers: last_triggered_rule set to: None
2016-11-01 23:23:46,208 | INFO | content-providers: vote: 0 ; scale_factor requested: 0
2016-11-01 23:23:46,209 | INFO | content-providers: application ready: True
2016-11-01 23:23:46,209 | INFO | content-providers: instances: min:1, running:1, max:10
2016-11-01 23:23:46,211 | DEBUG | Version matching is enforced. Version: 0.0.3
2016-11-01 23:23:46,212 | DEBUG | content-providers: participating
2016-11-01 23:23:46,213 | INFO | Decisions are completed.
Any ideas?
Your metric:
2016-11-01 23:23:46,208 | INFO | content-providers: metrics: {'mem': 28.426930581835386, 'cpu': 49.20513632498708}
Your fastscaleup
rule:
"mas_rule_fastscaleup_1": "cpu | >10 | PT10S | 3 | PT1M30S",
"mas_rule_fastscaleup_2": "memory | >50 | PT10S | 3 | PT1M30S"
What your rule is set up for is: "If CPU utilization is greater than 10% and memory utilization is greater than 50% for 1 minute, 30 seconds; then scale the application +3 instances." However, your memory usage has not reached the >50
threshold, but your CPU utilization has surpassed the threshold.
If you are wanting the application to scale on either memory or CPU, then split the rule by providing different rule names, like so:
"mas_rule_fastcpuscaleup_1": "cpu | >10 | PT10S | 3 | PT1M30S",
"mas_rule_fastmemscaleup_1": "memory | >50 | PT10S | 3 | PT1M30S"
Thanks @kernelpanek-tendril
I adjusted the rules as you suggested, and was able to cause the autoscaler to trigger. However, I never see new instances, despite seeing scale_factor requested: 3
in the logs..
2016-11-02 04:24:46,137 | INFO | Decision process beginning.
2016-11-02 04:24:46,138 | DEBUG | Version matching is enforced. Version: 0.0.3
2016-11-02 04:24:46,138 | DEBUG | content-providers: participating
2016-11-02 04:24:46,138 | INFO | content-providers: metrics: {'mem': 16.858591532639547, 'cpu': 36.4259688760453}
2016-11-02 04:24:46,138 | INFO | content-providers: last_triggered_rule set to: [{'ruleInfo': {'rulePart': u'1', 'ruleName': u'fastcpuscaleup'}, 'ruleValue': {'scale_factor': u'3', 'weight': 1.0, 'threshold': {'val': u'10', 'op': u'>'}, 'metric': u'cpu', 'tolerance': u'PT10S', 'backoff': u'PT1M30S'}}]
2016-11-02 04:24:46,138 | INFO | content-providers: vote: 1 ; scale_factor requested: 3
2016-11-02 04:24:46,139 | INFO | content-providers: application ready: True
2016-11-02 04:24:46,139 | INFO | content-providers: instances: min:1, running:1, max:10
2016-11-02 04:24:46,139 | INFO | content-providers: tolerance window filled: True / 04:24:36.139353
2016-11-02 04:24:46,139 | DEBUG | content-providers: tolerance window filled: True / 04:24:36.139353
2016-11-02 04:24:46,139 | INFO | content-providers: within backoff window: False / 04:23:16.139694 - 04:24:46.139694
2016-11-02 04:24:46,139 | DEBUG | content-providers: []
2016-11-02 04:24:46,140 | DEBUG | Version matching is enforced. Version: 0.0.3
2016-11-02 04:24:46,140 | DEBUG | content-providers: participating
2016-11-02 04:24:46,140 | INFO | Decisions are completed.
What happens here? DEBUG | content-providers: []
That line with the empty list is the past decisions in the backoff window (or timeframe). No items should result in returning False
with regard to checking if the system is within the backoff period. However, that isn't the problem. I'm concerned with there not being any log entry for "tolerance reached: {True/False}" before the backoff window check, but after a positive "tolerance window filled: True" entry. I see the issue with the code on why this happens, but the tolerance has not yet been met because the performance tail has nothing in it. I'll submit a fix for this: #19.
hi @kernelpanek-tendril
Looks like tendrilinc/marathon-autoscaler/pull/21 was merged, but I'm still having trouble with my application scaling.
Here's my rule:
"labels": {
"use_marathon_autoscaler": "0.0.3",
"min_instances": "1",
"max_instances": "10",
"mas_rule_fastcpuscaleup_1": "cpu | >10 | PT10S | 3 | PT30S"
}
And here is the log from the autoscaler when I've got load on itL
2016-11-28 19:52:41,534 | INFO | Stats differentials collected.
2016-11-28 19:52:41,536 | INFO | Decision process beginning.
2016-11-28 19:52:41,537 | DEBUG | Version matching is enforced. Version: 0.0.3
2016-11-28 19:52:41,538 | DEBUG | content-providers: participating
2016-11-28 19:52:41,538 | INFO | content-providers: metrics: {'mem': 25.595185250946972, 'cpu': 112.3698662867287}
2016-11-28 19:52:41,539 | INFO | content-providers: last_triggered_rule set to: [{'ruleInfo': {'rulePart': u'1', 'ruleName': u'fastcpuscaleup'}, 'ruleValue': {'scale_factor': u'3', 'weight': 1.0, 'threshold': {'val': u'10', 'op': u'>'}, 'metric': u'cpu', 'tolerance': u'PT10S', 'backoff': u'PT30S'}}]
2016-11-28 19:52:41,540 | INFO | content-providers: vote: 1 ; scale_factor requested: 3
2016-11-28 19:52:41,540 | INFO | content-providers: application ready: True
2016-11-28 19:52:41,540 | INFO | content-providers: instances: min:1, running:1, max:10
2016-11-28 19:52:41,541 | INFO | content-providers: tolerance window filled: True / 19:52:31.541472
2016-11-28 19:52:41,542 | DEBUG | content-providers: tolerance window filled: True / 19:52:31.541472
2016-11-28 19:52:41,542 | INFO | content-providers: tolerance reached: False / 19:52:31.541472 - 19:52:41.541472
2016-11-28 19:52:41,542 | DEBUG | content-providers: vote_list: []
2016-11-28 19:52:41,543 | INFO | content-providers: within backoff window: False / 19:52:11.543266 - 19:52:41.543266
2016-11-28 19:52:41,543 | DEBUG | content-providers: []
2016-11-28 19:52:41,544 | DEBUG | Version matching is enforced. Version: 0.0.3
2016-11-28 19:52:41,544 | DEBUG | content-providers: participating
2016-11-28 19:52:41,545 | INFO | Decisions are completed.
I am now seeing the Tolerance reached
message, but it is reporting False
. Any ideas on why?
@cookandy Can you send a longer log of activity via a gist?
hi @kernelpanek-tendril
Here is a longer log... hopefully it's helpful:
https://gist.github.com/cookandy/a03562e41e3eeec11bafe13f2076f0dc
Any ideas @kernelpanek-tendril?
@cookandy I think I found my mistake in one of my prior commits in the history_manager, where I was attempting to remove an early return statement. The history_manager did not have any good unit tests, so the change was not flagged. I've been working on corrections for all of this... a fix, and tests for history_manager.
brilliant, thanks @kernelpanek-tendril! looking forward to the fixes :)
@kernelpanek-tendril when you say
Your fastscaleup rule:
"mas_rule_fastscaleup_1": "cpu | >10 | PT10S | 3 | PT1M30S", "mas_rule_fastscaleup_2": "memory | >50 | PT10S | 3 | PT1M30S"
What your rule is set up for is: "If CPU utilization is greater than 10% and memory utilization is greater than 50% for 1 minute, 30 seconds; then scale the application +3 instances." However, your memory usage has not reached the >50 threshold, but your CPU utilization has surpassed the threshold.
how fits PT10S on the sentence
any update @kernelpanek-tendril?
@jmparra The PT10S
is the backoff timespan. Once a rule is triggered and acted upon, the same rule should not be acted upon for the duration of the backoff timespan after the last action of the rule. So, when the mas_rule_fastscaleup
rule is used to scale the cluster up by 3 at 12:00:00am, it cannot be used again until 12:00:10am.
@cookandy Sorry for the delay. I got terribly sick after re:Invent this year and had a project with an EOCY deadline to finish. However, I am now on holiday break, so I am picking up where I left off.
Hi,
Thanks for the project - and thanks for the many enhancements recently. I'm having trouble getting the autoscaler working. I just built the docker image and ran it with:
docker run -e MESOS_URI=http://x.x.x.x:5050 -e MARATHON_URI=http://x.x.x.x:8080 -e AGENT_PORT=5051 marathon-autoscaler
From the startup logs, it looks like everything is running ok:
However, the app never scales under load with the following labels (using small durations for testing):
Do you see any reason why this wouldn't work?