Longevity GCE: Job hung at disrupt_stop_wait_start_scylla_server()->wait_db_up(), c-s client exit after 30mins

amoskong commented 7 years ago

Description:

Longevity test called disrupt_stop_wait_start_scylla_server() nemesis, and hung at waiting db up. detail: hangs between line 884 and line 885.

Actually DB is already UP.

 875     def _report_housekeeping_uuid(self, verbose=True):
 876         """
 877         report uuid of test db nodes to ScyllaDB
 878         """
 879         uuid_path = '/var/lib/scylla-housekeeping/housekeeping.uuid'
 880         mark_path = '/var/lib/scylla-housekeeping/housekeeping.uuid.marked'
 881         cmd = 'curl "https://i6a5h9l1kl.execute-api.us-east-1.amazonaws.com/prod/check_version?uu=%s&mark=scylla"'
 882 
 883         uuid_result = self.remoter.run('test -e %s' % uuid_path,
 884                                        ignore_status=True, verbose=verbose)
 885         mark_result = self.remoter.run('test -e %s' % mark_path,
 886                                        ignore_status=True, verbose=verbose)
 887         if uuid_result.exit_status == 0 and mark_result.exit_status != 0:
 888             result = self.remoter.run('cat %s' % uuid_path, verbose=verbose)
 889             self.remoter.run(cmd % result.stdout.strip())
 890             self.remoter.run('sudo -u scylla touch %s' % mark_path,
 891                              verbose=verbose)
 892 
 893     def wait_db_up(self, verbose=True):
 894         text = None
 895         if verbose:
 896             text = '%s: Waiting for DB services to be up' % self
 897         wait.wait_for(func=self.db_up, step=60,
 898                       text=text)
 899         self._report_housekeeping_uuid()

SSH process that lunched c-s client in jenkins slave didn't exit, but in 'S' status. c-s client exited unexpectedly, I didn't find it in loader, but there is no error in job log.

 1406  8050  1240  1240 ?           -1 S     1010   0:00  |                           \_ /bin/ssh -tt -a -x -o ControlPath=
/var/tmp/ssh-mastertpljsc/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/var/tmp/tmpG59aNn -o BatchMode=yes -o C
onnectTimeout=300 -o ServerAliveInterval=300 -l scylla-test -p 22 -i /jenkins/.ssh/scylla-test 35.185.60.228 echo TAG: load
er_idx:0-cpu_idx:0-keyspace_idx:1; /home/scylla-test/run_cassandra_stress.sh 0 1

checking in the grafana, there is no data wrote to cluster. I guess the c-s exited at that point.
I manually started a c-s client in loader, then you can find some writing in the end.
I manually stop/start db-node-004 twice, so you can see some RPC error at the end of job

I aborted the job, and some error outputted (no detail):

2017-02-18 00:12:56,434 stacktrace       L0036 ERROR| 
2017-02-18 00:12:56,434 stacktrace       L0039 ERROR| Reproduced traceback from: /usr/lib/python2.7/site-packages/avocado/core/test.py:435
NO detail

Time:(reference the grafana snapshot, Loader per server, Served Requests)

start: around 22:00 hang at wait_db_up(): around 2:30 c-s exited: around 3:00 start a c-s around 7:30

fireshot capture 104 - grafana - scylla per server metrics 1_ - http___35 185 62 241_3000_dashboar

amoskong commented 7 years ago

@lmr can you help to review this issue? It looks like a framework (Avocado/SCT) bug, or c-s issue. Not scylla issue.

job-log-from-StopWaitStartService-nemesis.txt

amoskong commented 7 years ago

Avocado version in slave machine : 36.0 (installed by pip)

just upgrade it to 36.3, and report a issue: https://github.com/avocado-framework/avocado/issues/1801 PIP: pip doesn't has latest 36.3lts #1801

amoskong commented 7 years ago

In another job, c-s client raised 30 java.io.IOException(Operation x10 on key(s) ...) from 2017-02-19 01:47:40,704 to 2017-02-19 01:47:40,813.

[stdout] java.io.IOException: Operation x10 on key(s) [363433384f4c37363830]: Error executing: (UnavailableException): Not enough replicas available for query at consistency QUORUM (2 required but only 1 alive)
[stdout] 
[stdout]         at org.apache.cassandra.stress.Operation.error(Operation.java:216)
[stdout]         at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:188)
[stdout]         at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:99)
[stdout]         at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:107)
[stdout]         at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:259)
[stdout]         at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:309)

And it raised thousands of Unavailable Exception : [stdout] com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 1 alive)

c-s client stop to write data to cluster at that time, the Loader per server and Served Requests per server went down to zero. c-s process might exit at that time.

In this job, Nemesis process works well, it kept switch between different Monkeys.

loader.log.txt

roydahan commented 7 years ago

@kongove, I don't think the c-s exception is related to SCT. I've seen that several times during 1TB longevity and it appears like a c-s behaviour (although I haven't verified that on cassandra yet).

It happens when c-s gets timeouts from scylla and after 10 timeouts per key reached, we gets this exception. Usually, it happens when two nodes are down at the same time and cl=Quorum while rf=3.

amoskong commented 7 years ago

@roydahan Yes, in latest job with (avocado 36.3), we got from C-s, Avocado & SCT worked well.

Do we need to keep watching the c-s client output, wait sometimes and reset/restart C-s client in the error case?

amoskong commented 7 years ago

SSH connections might have problem after longevity run 6+ hours, there is also a big rsync timeout in send_files, or there is no timeout to execute cmd in send_files(). This blocked the test, it might cause the c-s exit.

After fixed some issue of master ssh and timeout in send_files(), the c-s won't exit. https://github.com/scylladb/scylla-cluster-tests/pull/251/

Current problem: We can't see c-s output in jenkins console or log after job run 6+ hours, but grafana data displays well, c-s client lives in loader instance. It should be caused by pipe of master ssh is broken, I'm troubleshooting it.

I will cose this issue to track other problem in other place.

amoskong commented 7 years ago

This issue is solved by: a9b3c58 DecommissionMonkey: fix fail to create instance 3472d9c fix return value of add_nodes(): return added nodes

scylladb / scylla-cluster-tests

Longevity GCE: Job hung at disrupt_stop_wait_start_scylla_server()->wait_db_up(), c-s client exit after 30mins #242