pkolaczk / latte

Latency Tester for Apache Cassandra
Apache License 2.0
176 stars 19 forks source link

Retries don't get applied #100

Closed vponomaryov closed 2 months ago

vponomaryov commented 2 months ago

Using latest latte version it stops execution on the first failed query having retries be configured:

...
         Threads                    1                                                                 
     Connections                    1                                                                 
     Concurrency     [req]        128                                                                 
        Max rate    [op/s]                                                                            
          Warmup       [s]                                                                            
              └─      [op]          1                                                                 
        Run time       [s]       20.0                                                                 
              └─      [op]                                                                            
        Sampling       [s]        1.0                                                                 
              └─      [op]                                                                            
 Request timeout       [s]          5                                                                 
         Retries                   10                                                                 
    ├─ min delay      [ms]        100                                                                 
    └─ max delay      [ms]       5000                                                                 

LOG ════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
    Time    Cycles    Errors    Thrpt.     ────────────────────────────────── Latency [ms/op] ──────────────────────────────
     [s]      [op]      [op]    [op/s]             Min        25        50        75        90        99      99.9       Max
   1.001     16169         0     16161             2.5       5.1       6.7       9.6      12.8      16.4      19.8      21.5
   2.002     15580         0     15565             2.0       5.4       7.2       9.8      12.6      16.5      17.8      19.1
   3.001     15908         0     15922             2.1       5.2       6.9       9.8      12.4      16.6      18.7      19.1
   4.000     14899         0     14903             2.2       5.4       7.5      10.5      13.5      19.9      26.7      28.6
   5.001      4812         0      4807             2.3       7.8      13.7      35.1      68.6     114.2     168.3     169.3
   6.002      2958         0      2956             2.7      14.6      29.6      67.2      88.2     175.1     177.3     177.5
   7.001      5190         0      5197             2.3       7.0      11.8      29.5      64.1      98.7     122.8     124.8
   7.691      2610         1      3782             2.1       8.7      17.0      60.2      82.2     179.2     186.3     186.4
error: Cassandra error: Failed to execute query "INSERT INTO data.property(name, value) VALUES (:name, :value)  USING TIMESTAMP :client_ts" with params [Text("gdvgvpzlndxvqfpjdocj"), Text("gdvgvpzlndxvqfpjdocjmqkbdjijgu"), BigInt(1723026298)]: Invalid message: Frame error: early eof

In the above example I brought down 1 DB node from 3 having CL=ALL. Having 10 configured retries, I expect it all to be applied (and be long enough to survive that node come back to UN time). Instead I got latte stress crash.

Looks like it is direct cause of the following change: https://github.com/pkolaczk/latte/commit/8cbbe2b510903155bb01f55525ec8c4a402bcac8

pkolaczk commented 2 months ago

The problem with the previous behavior was that it kept retrying even on obvious user errors, like invalid query string in the script. And because the default retry count was high, it looked as if it froze.

I think in this case we need to add a parameter for controlling retries to select whether we want to retry only on timeout / overload errors (current behavior) or on all errors.