Add possibility to configure retry policy

vponomaryov commented 2 years ago

For the moment we cannot tell 'scylla-bench' to make retries if some request fails. So, add such possibility reusing 'RetryPolicy' mechanism from the 'gocql' package.

Following new options are added:

-retry-number (int), default values is '0' - no retries (status quo)
-retry-interval (str), default value is '1s'. Examples: a) '100ms' - linear retrying with the 100 millisecond interval. b) '1ms,2s' (same as '1ms,2') - exponential retrying starting with 1 millisecond and growing to 2 seconds at max if number of retries allows it.

fruch commented 2 years ago

@vponomaryov, you think we take this for a ride in SCT ? change the install to be from your branch ? there are use cases we will want to see the impact of this.

vponomaryov commented 2 years ago

@vponomaryov, you think we take this for a ride in SCT ? change the install to be from your branch ? there are use cases we will want to see the impact of this.

Why my branch? If no concerns, then merge PR, create new tag/version and update the scylla_bench_version config option in SCT. That's all. Then we will be able to use these new config options in the SCT.

fruch commented 2 years ago

@vponomaryov, you think we take this for a ride in SCT ? change the install to be from your branch ? there are use cases we will want to see the impact of this.

Why my branch? If no concerns, then merge PR, create new tag/version and update the scylla_bench_version config option in SCT. That's all. Then we will be able to use these new config options in the SCT.

Mainly cause i'm not an admin on this repo, plus we want to make sure it's would do the effect we anticipate for it.

try taking longevity-large-partition-8h or longevity-large-partition-asymmetric-cluster-3h for a run with this:

diff --git i/sdcm/cluster.py w/sdcm/cluster.py
index 0e2f349a..459e56ac 100644
--- i/sdcm/cluster.py
+++ w/sdcm/cluster.py
@@ -494,7 +494,7 @@ class BaseNode(AutoSshContainerMixin, WebDriverContainerMixin):  # pylint: disab
             echo 'export GOPATH=$HOME/go' >> $HOME/.bash_profile
             echo 'export PATH=$PATH:/usr/local/go/bin' >> $HOME/.bash_profile
             source $HOME/.bash_profile
-            curl -Lo sb.zip https://github.com/scylladb/scylla-bench/archive/refs/{sb_version}.zip
+            curl -Lo sb.zip https://github.com/vponomaryov/scylla-bench/archive/refs/heads/add-retry.zip
             unzip sb.zip
             cd ./scylla-bench-*
             GO111MODULE=on go install .

vponomaryov commented 2 years ago

Test job where this PR was applied: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/valerii/job/longevity-large-partition-asymmetric-cluster-3h/2/console

roydahan commented 2 years ago

@vponomaryov if I understand the code correctly, it supports now the retry but it doesn't prevent the current errors we get for timeouts. What we want is that the current timeouts will be "warnings" but if all retries fails, we want a error (that SCT may treat it as a critical event - like we have in c-s).

vponomaryov commented 2 years ago

@vponomaryov if I understand the code correctly, it supports now the retry but it doesn't prevent the current errors we get for timeouts. What we want is that the current timeouts will be "warnings" but if all retries fails, we want a error (that SCT may treat it as a critical event - like we have in c-s).

There is separate option for timeout -> --timeout. Just configure values you need.

roydahan commented 2 years ago

@vponomaryov if I understand the code correctly, it supports now the retry but it doesn't prevent the current errors we get for timeouts. What we want is that the current timeouts will be "warnings" but if all retries fails, we want a error (that SCT may treat it as a critical event - like we have in c-s).

There is separate option for timeout -> --timeout. Just configure values you need.

This is not enough to answer the requirement.

vponomaryov commented 2 years ago

@vponomaryov if I understand the code correctly, it supports now the retry but it doesn't prevent the current errors we get for timeouts. What we want is that the current timeouts will be "warnings" but if all retries fails, we want a error (that SCT may treat it as a critical event - like we have in c-s).

There is separate option for timeout -> --timeout. Just configure values you need.

This is not enough to answer the requirement.

If you mean non-0 exit of a loader than it is different kind of a task - add one option about behavior on such an error. Right now it just logs errors and moves forward.

roydahan commented 2 years ago

I'm not sure I'm following, let's discuss it.

roydahan commented 2 years ago

@vponomaryov if needed help with defining a simple test to test it, please let me know.

vponomaryov commented 2 years ago

@vponomaryov if needed help with defining a simple test to test it, please let me know.

It was tested lots of times in multiple test runs. The only nuance here is that there are no means to make sure it was really applied by gocql. The details about the nuance are here: https://github.com/scylladb/gocql/issues/101

From my side - we can merge this one, because here we utilize the gocql functionality and if retry doesn't work then it is bug of gocql and not scylla-bench.

vponomaryov commented 2 years ago

Recent update: First commit makes retry logic be working correctly. Second commit enables the possibility to use retry logic.

Full workability of the retry logic requires also following fix to the gocql project: https://github.com/scylladb/gocql/pull/104 It means that, after it's merge, we should wait for the release with this fix and then switch scylla-bench to it.

roydahan commented 2 years ago

Do we want for the gocql PR to be merged before we merge this one?

vponomaryov commented 2 years ago

Do we want for the gocql PR to be merged before we merge this one?

No need in it. The fix of gocql will be applied by the update of the gocql version used by scylla-bench. And it will be separate change.

fruch commented 2 years ago

@roydahan, this is tested and confirmed working, and can be merged from my POV, both me and and @vponomaryov tested it.

fruch commented 2 years ago

@roydahan can you grant us permissions, we can move forward with some of the other fixes, and release/tag a new version ? so you can take them to SCT ?

scylladb / scylla-bench

Add possibility to configure retry policy #96