Spanner Unit Test b0rken on AppVeyor

dhermes commented 7 years ago

https://ci.appveyor.com/project/GoogleCloudPlatform/google-cloud-python/build/1.0.1072.master/job/qvj0q69feleskbpu

dhermes commented 7 years ago

@tseaver Having a not-green build is making me itchy. Are you planning on checking this out?

tseaver commented 7 years ago

I will start on this issue today. I have zero clue why the following would fail on Python 3.4 on Windows but not on Linux:

____________________ TestTransactionPingingPool.test_bind _____________________
self = <unit_tests.test_pool.TestTransactionPingingPool testMethod=test_bind>
    def test_bind(self):
        pool = self._makeOne()
        database = _Database('name')
        SESSIONS = [_Session(database) for _ in range(10)]
        database._sessions.extend(SESSIONS)

>       pool.bind(database)
unit_tests\test_pool.py:567: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.tox\py34\lib\site-packages\google\cloud\spanner\pool.py:413: in bind
    self.begin_pending_transactions()
.tox\py34\lib\site-packages\google\cloud\spanner\pool.py:440: in begin_pending_transactions
    super(TransactionPingingPool, self).put(session)
.tox\py34\lib\site-packages\google\cloud\spanner\pool.py:340: in put
    self._sessions.put_nowait((_NOW() + self._delta, session))
c:\python34\Lib\queue.py:187: in put_nowait
    return self.put(item, block=False)
c:\python34\Lib\queue.py:146: in put
    self._put(item)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <queue.PriorityQueue object at 0x043BEF30>
item = (datetime.datetime(2017, 2, 15, 0, 14, 41, 439103), <unit_tests.test_pool._Session object at 0x043AED50>)
    def _put(self, item):
>       heappush(self.queue, item)
E       TypeError: unorderable types: _Session() < _Session()
c:\python34\Lib\queue.py:230: TypeError

(There is an identical failure in unit_tests.test_pool.TestTransactionPingingPool testMethod=test_put_full).

tseaver commented 7 years ago

OK, my theory is that the clock resolution on the Appveyor Windows host is weird, such that successive calls to datetime.datetime.utcnow() are returning the same time, which makes the priority queue switch to the mock session to distinguish between entries.

I can work around the test failure by making _Session total-orderable: likely I need to ensure that the "real" Session class can be ordered, too.

dhermes commented 7 years ago

Victory: https://ci.appveyor.com/project/GoogleCloudPlatform/google-cloud-python/build/1.0.1079.master

dhermes commented 7 years ago

A "sporadic" failure: https://ci.appveyor.com/project/GoogleCloudPlatform/google-cloud-python/build/1.0.1081.master/job/rl8ms6nhw1gibeb1

@tseaver It seems very possible that the culprit is a unit test that talks to the backend on accident

tseaver commented 7 years ago

It seems very possible that the culprit is a unit test that talks to the backend on accident

I don't think so in this case: the database and spanner_api objects are completely mocked out. I think what is happening is that the Appveyor machine is slow / loaded enough such that the very low retry timeout (0.01 seconds) is cutting the retry loop off before the first retry. We can try raising that timeout (maybe 50 ms vs. 10?)

dhermes commented 7 years ago

Why don't we also mock the retry?

tseaver commented 7 years ago

The whole point of that test is to ensure that run_with_transaction does retry on a gRPC ABORT.

dhermes commented 7 years ago

No worries. Just want to make sure the test is deterministic.

tseaver commented 7 years ago

run_with_transaction(callback) sets up a transaction and passes it to callback, and afterwards attempts to commit the transaction. While the commit back-end raises ABORT, and the user-supplied timeout has not expired, it retries.

In order to exercise the retry loop without adding too much delay, I set the timeout to 10 msec. I didn't think it would be possible for the following to take longer than that:

        def unit_of_work(txn):
            called_with.append((txn, args, kw))
            txn.insert(TABLE_NAME, COLUMNS, VALUES)

(along with the call to txn.commit(), which then delegates to the mocked-out spanner_api, raising an Abort).

I didn't count on the fact that a loaded CI host might make that assumption.

dhermes commented 7 years ago

Why does the timeout play any role?

tseaver commented 7 years ago

The run_with_transaction method keeps retrying until the user-supplied timeout expires. In the failing test, the timeout expired unexpectedly during the first run of the callback, which led to the assertion failure: there was no retry.

dhermes commented 7 years ago

Got it. Can we mock out the part that decides when to retry so it doesn't rely on elapsed time (which is what makes it nondeterministic)?

tseaver commented 7 years ago

Have a look at the MUT (Session.run_in_transaction) is the bit that makes the decision.

dhermes commented 7 years ago

@tseaver Maybe that's an indication that a re-factor would be useful?

googleapis / google-cloud-python

Spanner Unit Test b0rken on AppVeyor #3011