u2 could die on any tap/click command, making test script vulnerable

gamesguru commented 5 years ago

Another issue on my rack of Allwinner d7s, sometimes one or two are dying over the weekend. Not a big deal but worth looking into.

Here we died at a swipe. It should be possible to seamlessly init --serial, but only after you accept the pull requests to fix #206


Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.6/http/client.py", line 1331, in getresponse
    response.begin()
  File "/usr/lib/python3.6/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.6/http/client.py", line 258, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/requests/adapters.py", line 445, in send
    timeout=timeout
  File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/local/lib/python3.6/site-packages/urllib3/util/retry.py", line 367, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.6/site-packages/urllib3/packages/six.py", line 686, in reraise
    raise value
  File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 306, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='127.0.0.1', port=37333): Read timed out. (read timeout=60)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/testlib/src/ocast3/testlogic.py", line 105, in main_parallel_test
    gridtest(d, s)
  File "/home/testlib/src/ocast3/testlogic.py", line 161, in gridtest
    swipeup(d)
  File "/home/testlib/src/ocast3/testlogic.py", line 60, in swipeup
    d.u2.swipe(x * d.x, y1 * d.y, x * d.x, y2 * d.y, 0.5)
  File "/usr/local/lib/python3.6/site-packages/uiautomator2/__init__.py", line 1441, in swipe
    return self.jsonrpc.swipe(fx, fy, tx, ty, steps)
  File "/usr/local/lib/python3.6/site-packages/uiautomator2/__init__.py", line 432, in __call__
    params, http_timeout)
  File "/usr/local/lib/python3.6/site-packages/uiautomator2/__init__.py", line 439, in jsonrpc_retry_call
    return self.jsonrpc_call(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/uiautomator2/__init__.py", line 482, in jsonrpc_call
    data=data)
  File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 559, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/uiautomator2/__init__.py", line 269, in request
    method, url, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 512, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 622, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/requests/adapters.py", line 526, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='127.0.0.1', port=37333): Read timed out. (read timeout=60)```

gamesguru commented 5 years ago

Attaching the atx log. Same thing happened today at a swipe up

atx-agent.log

mingyuan-xia commented 5 years ago

interesting point, need

[ ] check if a device is dead (atx-agent not responding??, maybe an overkill)
[ ] reboot the device
[ ] reinit/restart the agent to see if it is back I believe this is more about device farm control (atx-server functionality) than the automation framework's responsibility.

gamesguru commented 5 years ago

what is annoying about this (and it really only happens on a few devices out of 40 at the end of a week), is that it can happen with any command. dump_hierarchy(), swipedown(), tap(), I would assume also adb_shell() and others.

Therefore, to avoid general exceptions in the test we MUST wrap every ui automation command in a try except block. Not possible for the average person to fix (without greatly re-working their project), and anyways it makes more sense to fix it on the server side.

The device can be easily re-inited without rebooting, it is as if the issue is just a brief loss in usb. Only happens once or twice out of 40 devices running round the clock for weeks at at time. But currently, my project would not support it without lots of changes, try wrapper blocks mostly

I'm curious what you will think

mingyuan-xia commented 5 years ago

Yes I understand your feeling. I encountered this before. My strategy is to make tests stateless (no left-over on disk). Then do the big try-catch at test-level, with a test scheduler. Basically, the scheduler runs one test and harvests its result. If it detects failure because of the driver(u2), simply mark the device as temporarily down and reboot the device, re-init it to bring the device back again. Think of the driver failure like network connectivity interruption, which can happen randomly. And it is fair for the driver failure to happen more often, because a device environment is more complicated (dead wifi/network down/some stupid app has a service leaking memory/some automation accidentally cripples the OS, etc). Recovery (retry here) is inevitable. By the end of the day, the scheduler reports a per-device down-time-to-service-time ratio and overall metric for an operator (and the boss) to keep an eye on the health and efficiency of the entire device farm.

mingyuan-xia commented 5 years ago

BTW if your tests are dependent, it is still OK to use this design. The added complexity it for the scheduler to know about the dependency and restart necessary preceding tests for a failed one.

gamesguru commented 5 years ago

True, the end-user should be able to handle this without too much work on their end. Just re-init and pick up where left off.

If the automator dies, I would restart the whole test suite. If that fails 5 times in a row, I would assume some bug in uiautomator, to abort all, and mark tests as inconclusive.

Are you using xUnit or pytest or just your own implementation?

Note that this is a rare occurrence. Seeing this more in stress-tests that run continuously, than in quick unit tests. Then only, I see it out of 40 devices running a week on maybe one or two?

mingyuan-xia commented 5 years ago

The scheduler is a plain Python script, ~200LOC, using multiprocess one per device, and metrics updated in the console. Tests are u2 separate scripts.

gamesguru commented 5 years ago

The scheduler, you made it yourself? Are results saved to file or only printed to console?

mingyuan-xia commented 5 years ago

Test results persistent to a remote DB and metrics (the health of devices, effective-test-time-per-device, etc) shown in the console.

codeskyblue commented 5 years ago

Move to https://github.com/openatx/uiautomator2/issues/324

openatx / uiautomator2

u2 could die on any tap/click command, making test script vulnerable #224