Open JosefWN opened 3 years ago
Hi we increase in https://github.com/uber/cadence/pull/3753 and released in https://github.com/uber/cadence/releases/tag/v0.17.0
Please check it out and let us know if that meet you expectation. :D
Ah, looks good, missed that one. I can close this issue then!
Unfortunately, this issue is still causing problems for me. I am running Cadence locally using the docker-compose.yml auto setup. I have recently upgraded all my dockers to the latest versions. After running a workflow and then navigating to the Cadence GUI, I get the following error when I navigate to my domain workflows: Persistence Max QPS Reached for List Operations. This is then fixed by creating a custom dynamic config file, mounting the folder it is located in the docker and change the DYNAMIC_CONFIG_FILE_PATH environment variable of the cadence server docker to the corresponding file.
I would have expected this to be not necessary anymore?
@frtelg I have tested with my master-auto-setup
image(updated two days ago) and works fine.
here is my image info:
$docker images ubercadence/server:master-auto-setup
REPOSITORY TAG IMAGE ID CREATED SIZE
ubercadence/server master-auto-setup b68016029480 2 days ago 351MB
So make sure you upgrade the ubercadence/server:master-auto-setup
image by running the command:
docker pull ubercadence/server:master-auto-setup
Like we said in https://github.com/uber/cadence/tree/master/docker#using-a-released-image (we probably should make it more clear),ubercadence/server:master-auto-setup
is a changing image every minute/hour all the time, based on the commit on our master branch. You can use a released image if you want something stable.
@longquanzheng I have pulled the latest version of the image before testing it. I will test it again tomorrow, maybe I was still using an older version after all. I'll let you know.
@frtelg I just updated my image to current latest and still works:
$docker images ubercadence/server:master-auto-setup
REPOSITORY TAG IMAGE ID CREATED SIZE
ubercadence/server master-auto-setup 7890845d4a29 17 hours ago 351MB
I only run the helloworld sample. So can you also check if your workflow is claling the ListAPI?
I have tested again, using the following steps:
Then start my application. It is a basic Spring Boot application and my workflow is kind of a HelloWorld-kind of workflow in this case:
public interface GreetingWorkflow {
String TASK_LIST = "Example";
@WorkflowMethod(executionStartToCloseTimeoutSeconds = 360, taskList = TASK_LIST)
void greet();
@SignalMethod
void changeName(String name);
@SignalMethod
void terminate();
@QueryMethod
String getCurrentName();
}
This is the cadence server docker:
376996c5e71c ubercadence/server:master-auto-setup "/docker-entrypoint.…" 7 minutes ago Up 7 minutes 0.0.0.0:7933-7935->7933-7935/tcp, 0.0.0.0:7939->7939/tcp cadence_cadence_1
After this, I stop the dockers and the application and I add the following to the deployment.yaml file:
frontend.visibilityListMaxQPS:
- value: 10000
frontend.esVisibilityListMaxQPS:
- value: 10000
When I retest then, the GUI works as expected.
You can check out the application if you want, it is in my github: https://github.com/frtelg/cadence-spring-boot.
I can't reproduce it. For a time I saw it and made it thought it was an issue in webUI but then I cannot reproduce it anymore...
@frtelg does the released docker compose files help?
@longquanzheng it is not really clear to me what files you are referring to. I have used the default docker-compose from the cadence project. My docker-compose file looks like this:
version: '3'
services:
cassandra:
image: cassandra:3.11
ports:
- "9042:9042"
statsd:
image: graphiteapp/graphite-statsd
ports:
- "8080:80"
- "2003:2003"
- "8125:8125"
- "8126:8126"
cadence:
image: ubercadence/server:master-auto-setup
ports:
- "7933:7933"
- "7934:7934"
- "7935:7935"
- "7939:7939"
environment:
- "CASSANDRA_SEEDS=cassandra"
- "STATSD_ENDPOINT=statsd:8125"
- "DYNAMIC_CONFIG_FILE_PATH=custom-config/development.yaml"
depends_on:
- cassandra
- statsd
volumes:
- "./config:/etc/cadence/custom-config"
cadence-web:
image: ubercadence/web:latest
environment:
- "CADENCE_TCHANNEL_PEERS=cadence:7933"
ports:
- "8088:8088"
depends_on:
- cadence
@frtelg I understand this is annoying, I have open PR: https://github.com/uber/cadence/pull/4138
and also build an image so that you can test before the PR is landed
ubercadence/qlong-server:master-04-15-2021-auto-setup
Can you try use it and enable log level to debug
to see why the requests are rate limited?
(default log level is info: https://github.com/uber/cadence/blob/d0a8f7e6a9297bd898ea5b10ded20f0277a8980f/docker/config_template.yaml#L3 )
And let me know when you see the debug logs like
{"level":"debug","ts":"2021-04-15T23:30:50.086-0700","msg":"List API request consumed QPS token","service":"cadence-frontend","wf-domain-name":"samples-domain","name":"github.com/uber/cadence/common/persistence.(*visibilitySamplingClient).ListClosedWorkflowExecutions","logging-call-at":"visibilitySamplingClient.go:328"}
and
{"level":"debug","ts":"2021-04-15T19:00:21.956-0700","msg":"List API request is being sampled","service":"cadence-frontend","wf-domain-name":"samples-domain","name":"github.com/uber/cadence/common/persistence.(*visibilitySamplingClient).ListClosedWorkflowExecutions","logging-call-at":"visibilitySamplingClient.go:326"}
If they are not from your application, we will have a clue how to fix it.
@longquanzheng the supplied container version is not working:
cadence_1 | 2021/04/19 06:55:02 gocql: unable to dial control conn 172.24.0.2: dial tcp 172.24.0.2:9042: connect: connection refused
cadence_1 | 2021/04/19 06:55:02 cassandra schema version compatibility check failed: unable to create CQL Client: gocql: unable to create session: control: unable to connect to initial hosts: dial tcp 172.24.0.2:9042: connect: connection refused
This is my docker-compose.yml:
version: '3'
services:
cassandra:
image: cassandra:3.11
ports:
- "9042:9042"
statsd:
image: graphiteapp/graphite-statsd
ports:
- "8080:80"
- "2003:2003"
- "8125:8125"
- "8126:8126"
cadence:
image: ubercadence/qlong-server:master-04-15-2021-auto-setup
ports:
- "7933:7933"
- "7934:7934"
- "7935:7935"
- "7939:7939"
environment:
- "CASSANDRA_SEEDS=cassandra"
- "STATSD_ENDPOINT=statsd:8125"
# - "DYNAMIC_CONFIG_FILE_PATH=custom-config/development.yaml"
- "LOG_LEVEL=debug"
depends_on:
- cassandra
- statsd
volumes:
- "./config:/etc/cadence/custom-config"
cadence-web:
image: ubercadence/web:latest
environment:
- "CADENCE_TCHANNEL_PEERS=cadence:7933"
ports:
- "8088:8088"
depends_on:
- cadence
@frtelg Sorry that error was totally my bad when building the customized image. I forgot to add the auto-setup argument. At the same time I happened to have a local Cassandra to run in my laptop so I didn't catch it.
Can you try this one:
ubercadence/qlong-server:master-04-20-2021-auto-setup
LMK. Thanks
@frtelg I finally reproduced stably this myself. Will work on fixing it
^ I think I have root cause the issue. I think I got it repro because I updated my web image.
There is a change in the WebUI which always make 2 requests in the default page, so that it can show both open and closed workflows. However, our ratelimiting only have 1 as bucket size, even though the refiling rate is 10. So it will reject requests in very fast rate. Note that this is mostly only an issue in local docker-compose. To mitigate, user can select the closed or open view themselves, and ignore the error for now.
There is a change in the WebUI that by default it will try to get both open and closed workflows. So the default page at least has to make two List requests.
However, looks like the ratelimiting doesn't work as we expected-- or we didn't configure it correctly. Even though MaxQPS defaults to 10, but it's only refiling rates. It doesn't allow 2 requests at the same time. There are a couple ways to fix:
numOfPriority
, which is only 1 for List API: https://github.com/uber/cadence/blob/70031dedfef4e13679a09ff8628c395b37c3d37b/common/persistence/visibilitySamplingClient.go#L178
In other words, we are using token bucket as leakage bucket. To mitigate, user can select the closed or open view themselves, and ignore the error for now.
@just-at-uber Do you think we can implement a retry logic in webUI? I think it's useful in many ways. Even we could potentially add some initial size configuration for ratelimiting, it's still good to have some retry on WebUI when talking to Cadence Frontend.
Thanks! Great that you managed to find the bug. I did not find the time yet to retest it.
I think retry logic here is good to have anyway for this screen incase the API fails. Ideally the server should handle a higher load by default.
@just-at-uber yeah I agreed that server should also improve. I took a look but currently all the ratelimiting in server doesn't allow any bursting. It may take more effort to introduce it(also new configuration)
See misplaced issue: https://github.com/uber/cadence-web/issues/227