uber / cadence

Cadence is a distributed, scalable, durable, and highly available orchestration engine to execute asynchronous long-running business logic in a scalable and resilient way.
https://cadenceworkflow.io
MIT License
8.21k stars 791 forks source link

Error of Persistence Max QPS Reached for List Operations #3900

Open JosefWN opened 3 years ago

JosefWN commented 3 years ago

See misplaced issue: https://github.com/uber/cadence-web/issues/227

longquanzheng commented 3 years ago

Hi we increase in https://github.com/uber/cadence/pull/3753 and released in https://github.com/uber/cadence/releases/tag/v0.17.0

Please check it out and let us know if that meet you expectation. :D

JosefWN commented 3 years ago

Ah, looks good, missed that one. I can close this issue then!

frtelg commented 3 years ago

Unfortunately, this issue is still causing problems for me. I am running Cadence locally using the docker-compose.yml auto setup. I have recently upgraded all my dockers to the latest versions. After running a workflow and then navigating to the Cadence GUI, I get the following error when I navigate to my domain workflows: Persistence Max QPS Reached for List Operations. This is then fixed by creating a custom dynamic config file, mounting the folder it is located in the docker and change the DYNAMIC_CONFIG_FILE_PATH environment variable of the cadence server docker to the corresponding file.

I would have expected this to be not necessary anymore?

longquanzheng commented 3 years ago

@frtelg I have tested with my master-auto-setup image(updated two days ago) and works fine. here is my image info:

$docker images ubercadence/server:master-auto-setup
REPOSITORY           TAG                 IMAGE ID       CREATED      SIZE
ubercadence/server   master-auto-setup   b68016029480   2 days ago   351MB

So make sure you upgrade the ubercadence/server:master-auto-setup image by running the command: docker pull ubercadence/server:master-auto-setup

Like we said in https://github.com/uber/cadence/tree/master/docker#using-a-released-image (we probably should make it more clear),ubercadence/server:master-auto-setup is a changing image every minute/hour all the time, based on the commit on our master branch. You can use a released image if you want something stable.

frtelg commented 3 years ago

@longquanzheng I have pulled the latest version of the image before testing it. I will test it again tomorrow, maybe I was still using an older version after all. I'll let you know.

longquanzheng commented 3 years ago

@frtelg I just updated my image to current latest and still works:

$docker images ubercadence/server:master-auto-setup
REPOSITORY           TAG                 IMAGE ID       CREATED        SIZE
ubercadence/server   master-auto-setup   7890845d4a29   17 hours ago   351MB

I only run the helloworld sample. So can you also check if your workflow is claling the ListAPI?

frtelg commented 3 years ago

I have tested again, using the following steps:

  1. docker pull ubercadence/server:master-auto-setup
  2. docker-compose up
  3. Then start my application. It is a basic Spring Boot application and my workflow is kind of a HelloWorld-kind of workflow in this case:

    public interface GreetingWorkflow {
    String TASK_LIST = "Example";
    
    @WorkflowMethod(executionStartToCloseTimeoutSeconds = 360, taskList = TASK_LIST)
    void greet();
    
    @SignalMethod
    void changeName(String name);
    
    @SignalMethod
    void terminate();
    
    @QueryMethod
    String getCurrentName();
    }
  4. The application initializes the WorkflowService, WorkflowClient and WorkerFactory using Spring Beans and then start the Workers.
  5. Workflow is then started through a REST call. I don't think the ListAPI is involved in all of this?
  6. Then I check the Cadence GUI and unfortunately get the error: Persistence Max QPS Reached for List Operations.

This is the cadence server docker:

376996c5e71c   ubercadence/server:master-auto-setup   "/docker-entrypoint.…"   7 minutes ago   Up 7 minutes   0.0.0.0:7933-7935->7933-7935/tcp, 0.0.0.0:7939->7939/tcp                                                                                     cadence_cadence_1

After this, I stop the dockers and the application and I add the following to the deployment.yaml file:

frontend.visibilityListMaxQPS:
- value: 10000
frontend.esVisibilityListMaxQPS:
- value: 10000

When I retest then, the GUI works as expected.

You can check out the application if you want, it is in my github: https://github.com/frtelg/cadence-spring-boot.

longquanzheng commented 3 years ago

I can't reproduce it. For a time I saw it and made it thought it was an issue in webUI but then I cannot reproduce it anymore...

longquanzheng commented 3 years ago

@frtelg does the released docker compose files help?

frtelg commented 3 years ago

@longquanzheng it is not really clear to me what files you are referring to. I have used the default docker-compose from the cadence project. My docker-compose file looks like this:

version: '3'
services:
  cassandra:
    image: cassandra:3.11
    ports:
      - "9042:9042"
  statsd:
    image: graphiteapp/graphite-statsd
    ports:
      - "8080:80"
      - "2003:2003"
      - "8125:8125"
      - "8126:8126"
  cadence:
    image: ubercadence/server:master-auto-setup
    ports:
     - "7933:7933"
     - "7934:7934"
     - "7935:7935"
     - "7939:7939"
    environment:
      - "CASSANDRA_SEEDS=cassandra"
      - "STATSD_ENDPOINT=statsd:8125"
      - "DYNAMIC_CONFIG_FILE_PATH=custom-config/development.yaml"
    depends_on:
      - cassandra
      - statsd
    volumes:
      - "./config:/etc/cadence/custom-config"
  cadence-web:
    image: ubercadence/web:latest
    environment:
      - "CADENCE_TCHANNEL_PEERS=cadence:7933"
    ports:
      - "8088:8088"
    depends_on:
      - cadence
longquanzheng commented 3 years ago

@frtelg I understand this is annoying, I have open PR: https://github.com/uber/cadence/pull/4138 and also build an image so that you can test before the PR is landed ubercadence/qlong-server:master-04-15-2021-auto-setup Can you try use it and enable log level to debug to see why the requests are rate limited? (default log level is info: https://github.com/uber/cadence/blob/d0a8f7e6a9297bd898ea5b10ded20f0277a8980f/docker/config_template.yaml#L3 )

And let me know when you see the debug logs like

{"level":"debug","ts":"2021-04-15T23:30:50.086-0700","msg":"List API request consumed QPS token","service":"cadence-frontend","wf-domain-name":"samples-domain","name":"github.com/uber/cadence/common/persistence.(*visibilitySamplingClient).ListClosedWorkflowExecutions","logging-call-at":"visibilitySamplingClient.go:328"}

and

{"level":"debug","ts":"2021-04-15T19:00:21.956-0700","msg":"List API request is being sampled","service":"cadence-frontend","wf-domain-name":"samples-domain","name":"github.com/uber/cadence/common/persistence.(*visibilitySamplingClient).ListClosedWorkflowExecutions","logging-call-at":"visibilitySamplingClient.go:326"}

If they are not from your application, we will have a clue how to fix it.

frtelg commented 3 years ago

@longquanzheng the supplied container version is not working:

cadence_1      | 2021/04/19 06:55:02 gocql: unable to dial control conn 172.24.0.2: dial tcp 172.24.0.2:9042: connect: connection refused
cadence_1      | 2021/04/19 06:55:02 cassandra schema version compatibility check failed: unable to create CQL Client: gocql: unable to create session: control: unable to connect to initial hosts: dial tcp 172.24.0.2:9042: connect: connection refused

This is my docker-compose.yml:

version: '3'
services:
  cassandra:
    image: cassandra:3.11
    ports:
      - "9042:9042"
  statsd:
    image: graphiteapp/graphite-statsd
    ports:
      - "8080:80"
      - "2003:2003"
      - "8125:8125"
      - "8126:8126"
  cadence:
    image: ubercadence/qlong-server:master-04-15-2021-auto-setup
    ports:
     - "7933:7933"
     - "7934:7934"
     - "7935:7935"
     - "7939:7939"
    environment:
      - "CASSANDRA_SEEDS=cassandra"
      - "STATSD_ENDPOINT=statsd:8125"
#      - "DYNAMIC_CONFIG_FILE_PATH=custom-config/development.yaml"
      - "LOG_LEVEL=debug"
    depends_on:
      - cassandra
      - statsd
    volumes:
      - "./config:/etc/cadence/custom-config"
  cadence-web:
    image: ubercadence/web:latest
    environment:
      - "CADENCE_TCHANNEL_PEERS=cadence:7933"
    ports:
      - "8088:8088"
    depends_on:
      - cadence
longquanzheng commented 3 years ago

@frtelg Sorry that error was totally my bad when building the customized image. I forgot to add the auto-setup argument. At the same time I happened to have a local Cassandra to run in my laptop so I didn't catch it.

Can you try this one: ubercadence/qlong-server:master-04-20-2021-auto-setup

LMK. Thanks

longquanzheng commented 3 years ago

@frtelg I finally reproduced stably this myself. Will work on fixing it

Screen Shot 2021-04-24 at 3 42 32 PM Screen Shot 2021-04-24 at 3 42 42 PM
longquanzheng commented 3 years ago

^ I think I have root cause the issue. I think I got it repro because I updated my web image.

TL;DR

There is a change in the WebUI which always make 2 requests in the default page, so that it can show both open and closed workflows. However, our ratelimiting only have 1 as bucket size, even though the refiling rate is 10. So it will reject requests in very fast rate. Note that this is mostly only an issue in local docker-compose. To mitigate, user can select the closed or open view themselves, and ignore the error for now.


There is a change in the WebUI that by default it will try to get both open and closed workflows. So the default page at least has to make two List requests.

However, looks like the ratelimiting doesn't work as we expected-- or we didn't configure it correctly. Even though MaxQPS defaults to 10, but it's only refiling rates. It doesn't allow 2 requests at the same time. There are a couple ways to fix:

To mitigate, user can select the closed or open view themselves, and ignore the error for now.

@just-at-uber Do you think we can implement a retry logic in webUI? I think it's useful in many ways. Even we could potentially add some initial size configuration for ratelimiting, it's still good to have some retry on WebUI when talking to Cadence Frontend.

frtelg commented 3 years ago

Thanks! Great that you managed to find the bug. I did not find the time yet to retest it.

just-at-uber commented 3 years ago

I think retry logic here is good to have anyway for this screen incase the API fails. Ideally the server should handle a higher load by default.

longquanzheng commented 3 years ago

@just-at-uber yeah I agreed that server should also improve. I took a look but currently all the ratelimiting in server doesn't allow any bursting. It may take more effort to introduce it(also new configuration)