openfaas / faas-netes

Serverless Functions For Kubernetes
https://www.openfaas.com
MIT License
2.13k stars 472 forks source link

Investigate auto-scaling with multiple functions #241

Closed dkozlov closed 6 years ago

dkozlov commented 6 years ago

Expected Behaviour

Two functions should be scaled simultaneously

Current Behaviour

NAME                             READY     STATUS    RESTARTS   AGE
function1-79cdfc5c56-2dxlq   1/1       Running   0          20s
function1-79cdfc5c56-2j5tm   1/1       Running   0          20s
function1-79cdfc5c56-4xctc   1/1       Running   0          1m
function1-79cdfc5c56-7wkzc   1/1       Running   0          26m
function1-79cdfc5c56-8kzrz   1/1       Running   0          1m
function1-79cdfc5c56-9jrfm   1/1       Running   0          20s
function1-79cdfc5c56-g9hfr   1/1       Running   0          1m
function1-79cdfc5c56-jcts5   1/1       Running   0          26m
function1-79cdfc5c56-lkpqg   1/1       Running   0          1m
function1-79cdfc5c56-mmlw8   1/1       Running   0          1m
function1-79cdfc5c56-pn8n7   1/1       Running   0          20s
function1-79cdfc5c56-sz9zq   1/1       Running   0          1m
function1-79cdfc5c56-w8xc7   1/1       Running   0          20s
function1-79cdfc5c56-zmnkp   1/1       Running   0          20s
function1-79cdfc5c56-zs542   1/1       Running   0          26m
function2-64cd5b486d-ptzm2      1/1       Running   0          26m
function2-64cd5b486d-rhm52      1/1       Running   0          26m
function2-64cd5b486d-tl64m      1/1       Running   0          26m

Only first function was scaled during concurrent performance test with https://github.com/locustio/locust

Possible Solution

Steps to Reproduce (for bugs)

  1. Create two functions with following labels: "com.openfaas.scale.min": 3 "com.openfaas.scale.max": 10 "com.openfaas.scale.factor": 20
  2. I have performed concurrent performance testing of two functions with locust (https://github.com/locustio/locust) and it seems that OpenFaas default auto-scaling works only for the first function and second function is not scaled. Load was uniform across both functions
  3. Both functions are using of-watchdog (https://github.com/openfaas-incubator/of-watchdog/releases/download/0.2.3/of-watchdog)

Context

Your Environment

Server: Engine: Version: 17.12.1-ce API version: 1.35 (minimum version 1.12) Go version: go1.9.4 Git commit: 7390fc6 Built: Tue Feb 27 22:16:13 2018 OS/Arch: linux/amd64 Experimental: false

alexellis commented 6 years ago

Hi @dkozlov we're looking into this with @ivanayov helping.

Please provide us with your function as an image on the Docker Hub, and if you can't share the code then push the equivalent of hello world. Thanks

Alex

ivanayov commented 6 years ago

Hi @dkozlov ,

I did a simple test, running two functions together asynchronously and they both scaled. I'm trying to reproduce the issue.

dkozlov commented 6 years ago

Hi @ivanayov, I was running two functions synchronously by distributing load on both functions

ivanayov commented 6 years ago

@dkozlov

I did a test with locustfile.py:

from locust import HttpLocust, TaskSet

def test1(l):
    l.client.get("http://127.0.0.1:8080/function/test1")

def test2(l):
    l.client.get("http://127.0.0.1:8080/function/test2")

class UserBehavior(TaskSet):
    tasks = {test1: 1, test2: 1}

class WebsiteUser(HttpLocust):
    task_set = UserBehavior
    min_wait = 1000
    max_wait = 1000

and labels:

    labels:
      com.openfaas.scale.min: 3
      com.openfaas.scale.max: 10
      com.openfaas.scale.factor: 20

of-watchdog: https://github.com/openfaas-incubator/of-watchdog/releases/download/0.2.5/of-watchdog

Both scale up to 10.

I tried replacing with with post methods and scaling back to 0. This was the only case I reproduced.

Then I tried back with get, scaled to 0 and then got 10 replicas for both after few seconds.

Is there any specific use case in your functions? Any related iformation you can share would be useful.

dkozlov commented 6 years ago

Actually I am using post method in locust, so you can reproduce this issue?

alexellis commented 6 years ago

@dkozlov

There are too many variables, please publish your function code or a comparable function where you still observe the behaviour.

ivanayov commented 6 years ago

These are the tests I did with post:

1) func1:

      com.openfaas.scale.min: 3
      com.openfaas.scale.max: 10

func2:

      com.openfaas.scale.min: 3
      com.openfaas.scale.max: 10

Result: only func1 is invoked

2) func1:

      com.openfaas.scale.min: 3
      com.openfaas.scale.max: 3

func2:

      com.openfaas.scale.min: 3
      com.openfaas.scale.max: 3

Result: only func1 is invoked

3) func1:

      com.openfaas.scale.min: 1
      com.openfaas.scale.max: 1

func2:

      com.openfaas.scale.min: 1
      com.openfaas.scale.max: 1

Result: both are invoked and scale

4) func1:

      com.openfaas.scale.min: 3
      com.openfaas.scale.max: 4

func2:

      com.openfaas.scale.min: 1
      com.openfaas.scale.max: 1

Result: both are invoked and scale

5) func1:

      com.openfaas.scale.min: 3
      com.openfaas.scale.max: 4

func2:

      com.openfaas.scale.min: 2
      com.openfaas.scale.max: 5

Result: both are invoked and scale

6) func1:

      com.openfaas.scale.min: 3
      com.openfaas.scale.max: 4

func2:

      com.openfaas.scale.min: 3
      com.openfaas.scale.max: 4

Result: both are invoked and scale

This makes me think it's more likely a locust issue - I find no strong relation between scaling parameters and final results.

@dkozlov did you reproduce with another tool?

If there is an issue with scaling, we would like to find and fix it. With the current test this can't be verified.

dkozlov commented 6 years ago

@ivanayov I will try another tools but It will not be soon. Obviously locust does not know anything about OpenFaas scaling params. Have you noticed different scaling behavior during reproducing following test cases several times?

func1:

      com.openfaas.scale.min: 3
      com.openfaas.scale.max: 10

func2:

      com.openfaas.scale.min: 3
      com.openfaas.scale.max: 10

Result: only func1 is invoked

func1:

      com.openfaas.scale.min: 3
      com.openfaas.scale.max: 3

func2:

      com.openfaas.scale.min: 3
      com.openfaas.scale.max: 3

Result: only func1 is invoked
alexellis commented 6 years ago

What do you mean by "invoked", do you mean "scaled"?

alexellis commented 6 years ago

We need your specific function or similar to reproduce the issue. I've asked 3 times in the thread now (maybe 4) please make this your next action item. https://github.com/openfaas/faas-netes/issues/241#issuecomment-404597005

dkozlov commented 6 years ago

I could not share my function code, sorry. But why do I need to do it? @ivanayov reproduced this issue without my code

alexellis commented 6 years ago

There are two options I gave - the function image and if that wasn't possible then a function built with the same template. You've not specified even which language runtime is being used. Are you using a standard template from the OpenFaaS project? How long does the function take to run? What is the latency/round-trip?

Can you share the JSON alerts from your gateway logs in a Gist (don't paste into the GitHub issue)

I can't see Ivana say anywhere in this issue that she reproduced the scaling. I can see a comment about being "invoked", but I don't know what this means.

alexellis commented 6 years ago

Seeing as it's hard to obtain the specific information needed to reproduce any issue accurately, let's switch things around.

Please use this container image: alexellis2/scale-op:0.1.0

It's written in Golang, uses of-watchdog / http and is hello world.

Please deploy twice.

export OPENFAAS_URL=http://127.0.0.1:31112
faas-cli deploy --image alexellis2/scale-op:0.1.0 --name scale-op-1
faas-cli deploy --image alexellis2/scale-op:0.1.0 --name scale-op-2

Then trigger with two concurrent hey instances:

export GOPATH=$HOME/go/
go get -u github.com/rakyll/hey
export PATH=$PATH:$GOPATH/bin

These parameters keep QPS around 10, high enough to trigger scaling, low enough not to throttle the CPU.

Terminal 1

hey -d="test" -m=POST -q 10 -c 10 -n 5000 $OPENFAAS_URL/function/scale-op-1

Terminal 2

hey -d="test" -m=POST -q 10 -c 10 -n 5000 $OPENFAAS_URL/function/scale-op-2

Increase the -n amount to adjust to duration, the test needs to take over 30 seconds to trigger scaling.

Now open AlertManager and paste the full alert data into a new Gist. Do the same for the gateway logs.

You can also monitor this query on Prometheus:

sum by(function_name) (rate(gateway_function_invocation_total{code="200"}[10s]))

Thanks,

Alex

alexellis commented 6 years ago

My findings.. auto-scaling worked correctly (this is using Docker Swarm)

Two separate alerts firing

screen shot 2018-07-12 at 21 38 19

Monitoring the query:

screen shot 2018-07-12 at 21 38 12

So the QPS of 10 over a 10s window = 10x10 = 100 as shown/expected. The alerting rule is set to 5 QPS, so we can lower the traffic if needed and still trigger the behaviour.

Seeing the scaling for both functions independently:

func_gateway.1.xi8gqsi7wbu6@nuc    | 2018/07/12 20:38:21 Alert received.
func_gateway.1.xi8gqsi7wbu6@nuc    | 2018/07/12 20:38:21 {"receiver":"scale-up","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"APIHighInvocationRate","function_name":"scale-op-1","monitor":"faas-monitor","service":"gateway","severity":"major","value":"100"},"annotations":{"description":"High invocation total on ","summary":"High invocation total on "},"startsAt":"2018-07-12T20:38:16.176126585Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://ec48985f44d4:9090/graph?g0.expr=sum+by%28function_name%29+%28rate%28gateway_function_invocation_total%7Bcode%3D%22200%22%7D%5B10s%5D%29%29+%3E+5\u0026g0.tab=1"},{"status":"firing","labels":{"alertname":"APIHighInvocationRate","function_name":"scale-op-2","monitor":"faas-monitor","service":"gateway","severity":"major","value":"96.58"},"annotations":{"description":"High invocation total on ","summary":"High invocation total on "},"startsAt":"2018-07-12T20:38:16.176126585Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://ec48985f44d4:9090/graph?g0.expr=sum+by%28function_name%29+%28rate%28gateway_function_invocation_total%7Bcode%3D%22200%22%7D%5B10s%5D%29%29+%3E+5\u0026g0.tab=1"}],"groupLabels":{"alertname":"APIHighInvocationRate","service":"gateway"},"commonLabels":{"alertname":"APIHighInvocationRate","monitor":"faas-monitor","service":"gateway","severity":"major"},"commonAnnotations":{"description":"High invocation total on ","summary":"High invocation total on "},"externalURL":"http://c7e6cea35d07:9093","version":"4","groupKey":"{}:{alertname=\"APIHighInvocationRate\", service=\"gateway\"}"}
func_gateway.1.xi8gqsi7wbu6@nuc    | 
func_gateway.1.xi8gqsi7wbu6@nuc    | 2018/07/12 20:38:21 [Scale] function=scale-op-1 1 => 4.
func_gateway.1.xi8gqsi7wbu6@nuc    | 2018/07/12 20:38:21 [Scale] function=scale-op-2 1 => 4.
alexellis commented 6 years ago

I just replayed the same test on GKE with Kubernetes v1.10.5-gke.0 and the steps above. Both functions scaled independently.

screen shot 2018-07-12 at 21 49 39 screen shot 2018-07-12 at 21 49 36

So I can't reproduce the issue with the information given, but I'd like to understand what is different about your code, template, environment etc. Is this strictly related to the overridden min/max labels? Do you have connectivity issues with AlertManager? How do the logs look?

alexellis commented 6 years ago

Scaling with custom min/max:

2018/07/12 20:53:13 Alert received.
2018/07/12 20:53:13 {"receiver":"scale-up","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"APIHighInvocationRate","function_name":"scale-op-1","monitor":"faas-monitor","service":"gateway","severity":"major","value":"26.026026026026027"},"annotations":{"description":"High invocation total on \"\"","summary":"High invocation total on \"\""},"startsAt":"2018-07-12T20:53:08.324404523Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://prometheus-7b5ff68859-5q7g4:9090/graph?g0.expr=sum+by%28function_name%29+%28rate%28gateway_function_invocation_total%7Bcode%3D%22200%22%7D%5B10s%5D%29%29+%3E+5\u0026g0.tab=1"},{"status":"firing","labels":{"alertname":"APIHighInvocationRate","function_name":"scale-op-2","monitor":"faas-monitor","service":"gateway","severity":"major","value":"14.014014014014014"},"annotations":{"description":"High invocation total on \"\"","summary":"High invocation total on \"\""},"startsAt":"2018-07-12T20:53:08.324404523Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://prometheus-7b5ff68859-5q7g4:9090/graph?g0.expr=sum+by%28function_name%29+%28rate%28gateway_function_invocation_total%7Bcode%3D%22200%22%7D%5B10s%5D%29%29+%3E+5\u0026g0.tab=1"}],"groupLabels":{"alertname":"APIHighInvocationRate","service":"gateway"},"commonLabels":{"alertname":"APIHighInvocationRate","monitor":"faas-monitor","service":"gateway","severity":"major"},"commonAnnotations":{"description":"High invocation total on \"\"","summary":"High invocation total on \"\""},"externalURL":"http://alertmanager-68fb464789-5qjfl:9093","version":"4","groupKey":"{}:{alertname=\"APIHighInvocationRate\", service=\"gateway\"}"}

2018/07/12 20:53:13 Forwarded [POST] to /function/scale-op-1 - [200] - 0.002942 seconds
2018/07/12 20:53:13 Forwarded [POST] to /function/scale-op-1 - [200] - 0.003210 seconds
2018/07/12 20:53:13 Forwarded [POST] to /function/scale-op-1 - [200] - 0.004360 seconds
2018/07/12 20:53:13 Forwarded [POST] to /function/scale-op-1 - [200] - 0.005033 seconds
2018/07/12 20:53:13 Forwarded [POST] to /function/scale-op-1 - [200] - 0.006735 seconds
2018/07/12 20:53:13 Forwarded [POST] to /function/scale-op-1 - [200] - 0.007108 seconds
2018/07/12 20:53:13 Forwarded [POST] to /function/scale-op-1 - [200] - 0.007903 seconds
2018/07/12 20:53:13 Forwarded [POST] to /function/scale-op-1 - [200] - 0.007600 seconds
2018/07/12 20:53:13 Forwarded [POST] to /function/scale-op-1 - [200] - 0.007750 seconds
2018/07/12 20:53:13 Forwarded [POST] to /function/scale-op-1 - [200] - 0.008380 seconds
2018/07/12 20:53:13 [Scale] function=scale-op-1 3 => 5.
2018/07/12 20:53:13 Forwarded [POST] to /function/scale-op-2 - [200] - 0.002564 seconds
2018/07/12 20:53:13 Forwarded [POST] to /function/scale-op-2 - [200] - 0.003464 seconds
2018/07/12 20:53:13 error scaling HTTP code 202, http://127.0.0.1:8081/system/scale-function/scale-op-1
2018/07/12 20:53:13 Forwarded [POST] to /function/scale-op-2 - [200] - 0.007358 seconds
2018/07/12 20:53:13 Forwarded [POST] to /function/scale-op-2 - [200] - 0.008132 seconds
2018/07/12 20:53:13 [Scale] function=scale-op-2 3 => 5.
2018/07/12 20:53:13 Forwarded [POST] to /function/scale-op-2 - [200] - 0.007401 seconds
2018/07/12 20:53:13 Forwarded [POST] to /function/scale-op-2 - [200] - 0.008908 seconds
2018/07/12 20:53:13 Forwarded [POST] to /function/scale-op-2 - [200] - 0.010662 seconds
2018/07/12 20:53:13 Forwarded [POST] to /function/scale-op-2 - [200] - 0.010636 seconds
2018/07/12 20:53:13 Forwarded [POST] to /function/scale-op-2 - [200] - 0.010718 seconds
2018/07/12 20:53:13 Forwarded [POST] to /function/scale-op-2 - [200] - 0.012177 seconds
2018/07/12 20:53:13 error scaling HTTP code 202, http://127.0.0.1:8081/system/scale-function/scale-op-2
2018/07/12 20:53:13 [error scaling HTTP code 202, http://127.0.0.1:8081/system/scale-function/scale-op-1 error scaling HTTP code 202, http://127.0.0.1:8081/system/scale-function/scale-op-2]
screen shot 2018-07-12 at 21 53 42 screen shot 2018-07-12 at 21 53 39

Alex

ivanayov commented 6 years ago

@alexellis Every time a function is invoked from locust, it scales. When I say "only func1 is invoked" in this comment, I mean that it's invoked and scales, but the other function is not invoked at all.

I may have missed those details in the issue, as the discussion goes in slack as well.

In the comment above, cases 1 and 2, I reproduce the issue with simple python3 functions (more details about configuration here), while in 3, 4, 5 and 6 I don't reproduce. I find no strong relation between scaling parameters and test results.

@dkozlov as Alex asked above, the JSON alerts from your gateway logs can help us localize the problem.

dkozlov commented 6 years ago

I need time to recreate my environment with Kubernetes and create similar functions.

@ivanayov Let's consider cases 1 and 2 in more details. Why do you think second function is not invoked at all? It is connectivity problem or internal openfaas errors or kubernetes issue or locust issue?

ivanayov commented 6 years ago

Derek add label: support

alexellis commented 6 years ago

We're going to pause our work on this issue until you've completed the work asked for here - https://github.com/openfaas/faas-netes/issues/241#issuecomment-404642631

Thanks,

Alex

alexellis commented 6 years ago

Given that we cannot reproduce the issue in testing and can reproduce the desired behaviour consistently I’m going to close this issue. Please feel free to keep commenting and to open an issue with the Locust team if you need to.

Alex

dkozlov commented 6 years ago

So I can not understand why you are thinking @ivanayov reproduced locust wrong behavior? Have you analyzed traffic from locust? Did locust can not make concurrent requests?