Closed amoskong closed 7 years ago
@lmr can you help to review this issue? It looks like a framework (Avocado/SCT) bug, or c-s issue. Not scylla issue.
Avocado version in slave machine : 36.0 (installed by pip)
just upgrade it to 36.3, and report a issue: https://github.com/avocado-framework/avocado/issues/1801 PIP: pip doesn't has latest 36.3lts #1801
In another job, c-s client raised 30 java.io.IOException(Operation x10 on key(s) ...) from 2017-02-19 01:47:40,704 to 2017-02-19 01:47:40,813.
[stdout] java.io.IOException: Operation x10 on key(s) [363433384f4c37363830]: Error executing: (UnavailableException): Not enough replicas available for query at consistency QUORUM (2 required but only 1 alive)
[stdout]
[stdout] at org.apache.cassandra.stress.Operation.error(Operation.java:216)
[stdout] at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:188)
[stdout] at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:99)
[stdout] at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:107)
[stdout] at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:259)
[stdout] at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:309)
And it raised thousands of Unavailable Exception : [stdout] com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 1 alive)
c-s client stop to write data to cluster at that time, the Loader per server
and Served Requests per server
went down to zero. c-s process might exit at that time.
In this job, Nemesis process works well, it kept switch between different Monkeys.
@kongove, I don't think the c-s exception is related to SCT. I've seen that several times during 1TB longevity and it appears like a c-s behaviour (although I haven't verified that on cassandra yet).
It happens when c-s gets timeouts from scylla and after 10 timeouts per key reached, we gets this exception. Usually, it happens when two nodes are down at the same time and cl=Quorum while rf=3.
@roydahan Yes, in latest job with (avocado 36.3), we got from C-s, Avocado & SCT worked well.
Do we need to keep watching the c-s client output, wait sometimes and reset/restart C-s client in the error case?
SSH connections might have problem after longevity run 6+ hours, there is also a big rsync timeout in send_files, or there is no timeout to execute cmd in send_files(). This blocked the test, it might cause the c-s exit.
After fixed some issue of master ssh and timeout in send_files(), the c-s won't exit. https://github.com/scylladb/scylla-cluster-tests/pull/251/
Current problem: We can't see c-s output in jenkins console or log after job run 6+ hours, but grafana data displays well, c-s client lives in loader instance. It should be caused by pipe of master ssh is broken, I'm troubleshooting it.
I will cose this issue to track other problem in other place.
This issue is solved by: a9b3c58 DecommissionMonkey: fix fail to create instance 3472d9c fix return value of add_nodes(): return added nodes
Jenkins job 68
Description:
line 884
andline 885
.Actually DB is already UP.
jenkins slave
didn't exit, but in 'S' status. c-s client exited unexpectedly, I didn't find it in loader, but there is no error in job log.checking in the grafana, there is no data wrote to cluster. I guess the c-s exited at that point.
I manually started a c-s client in loader, then you can find some writing in the end.
I manually stop/start db-node-004 twice, so you can see some RPC error at the end of job
I aborted the job, and some error outputted (no detail):
Time:(reference the grafana snapshot,
Loader per server
,Served Requests
)start: around 22:00 hang at wait_db_up(): around 2:30 c-s exited: around 3:00 start a c-s around 7:30