scylladb / cassandra-stress

Apache License 2.0
4 stars 7 forks source link

Error handling on thread level #29

Open dkropachev opened 2 days ago

dkropachev commented 2 days ago

When the test starts c-s spins threads to generate a load, you can control the number of threads by -rate threads=X. Unless -errors ignore is provided, the thread is going to be killed after reaching a number of errors.

The problem is that c-s does not stop the test when thread is killed, even if all of them are killed.

Recently we have merged https://github.com/scylladb/cassandra-stress/pull/26, which has changed c-s behavior to the following:

  1. If -errors fail-fast is provided it would fail whole run after first thread is killed
  2. Otherwise it would keep waiting, unless all the threads are killed.

Now, in my book there are only three correct behaviors in regards to the threads handling:

  1. Fail the test when the first thread is killed. Currently works only if -errors fail-fast
  2. Not to kill a thread when error is encountered. keep it running. Currently works only if -errors ignore is provided
  3. Fail all thread at once when they reach error state, but keep them running untill that point. Not implemented currently

Anything in between I see as unwanted behavior that can lead to unexpected, unpredictable or inconsitant results. On this issue I want to start discussion to decide on what is the correct behavior` and implement it eventually.

dkropachev commented 2 days ago

@CodeLieutenant , @Bouncheck , @fruch, please take a look

CodeLieutenant commented 20 hours ago

3. Fail all thread at once when they reach error state, but keep them running untill that point. Not implemented currently

I don't really understand this point, this looks like to me a fail-fast. Kill all other threads when on thread reaches error state. Am I getting it correctly, or run and kill threads one by one, as they encounter errors.

We also need to decide what should be the default behaviour when -errors is not passed in, and whats the current state of the system