Watcher sometimes recording old state

viktordick commented 3 years ago

This is a hard to reproduce error, as it occurs only sporadically. I have not been able to pin down the exact circumstances under which it occurs.

As far as I have been able to understand, the wrong behavior is as follows:

I use zodbsync pick to pick some commits, which modifies some objects as well as adding some. The changed or added objects are played back into the ZODB.
The watcher notices that the objects have been changed and starts to record them
Of the changed objects, the watcher records their old state instead of the new one, overwriting the state that was created by the pick.

The correct new state of the objects is present in Zope, but not on the file system. Further sweeps do not pick up any additional transactions and do not correct this.

This should not happen. The watcher should always see the objects in the state after the transaction currently stored in self.last_visible_txn.

This might be related to the watcher tests failing sporadically, which is also hard to pin down. One example traceback is:

===================================================== FAILURES ======================================================
____________________________________________ TestSync.test_watch_change _____________________________________________

self = <perfact.zodbsync.tests.test_sync.TestSync object at 0x7fa274097e80>
conn = <perfact.zodbsync.helpers.Namespace object at 0x7fa2740d1eb0>

    def test_watch_change(self, conn):
        """
        Start the watcher, change something using the second connection without
        commiting yet, do a step on the watcher, make sure the change is not
        yet visible, then commit the change and do another step, making sure
        that it is now present.
        """
        fname = self.repo.path + '/__root__/__meta__'
        watcher = self.runner('watch')
        watcher.setup()
        conn.tm.begin()
        conn.app._addRole('TestRole')
        watcher.step()
        assert 'TestRole' not in open(fname).read()
        conn.tm.commit()
        watcher.step()
>       assert 'TestRole' in open(fname).read()
E       assert 'TestRole' in "[\n    ('is_root', True),\n    ('perms', [\n        ('Access contents information', False, ['Anonymous', 'Manager']),...bDAV access', False, ['Authenticated', 'Manager']),\n        ]),\n    ('title', 'Zope'),\n    ('type', 'Folder'),\n]\n"
E        +  where "[\n    ('is_root', True),\n    ('perms', [\n        ('Access contents information', False, ['Anonymous', 'Manager']),...bDAV access', False, ['Authenticated', 'Manager']),\n        ]),\n    ('title', 'Zope'),\n    ('type', 'Folder'),\n]\n" = <built-in method read of _io.TextIOWrapper object at 0x7fa2740e7380>()
E        +    where <built-in method read of _io.TextIOWrapper object at 0x7fa2740e7380> = <_io.TextIOWrapper name='/tmp/tmp0ox63v_h/__root__/__meta__' mode='r' encoding='UTF-8'>.read
E        +      where <_io.TextIOWrapper name='/tmp/tmp0ox63v_h/__root__/__meta__' mode='r' encoding='UTF-8'> = open('/tmp/tmp0ox63v_h/__root__/__meta__')

perfact/zodbsync/tests/test_sync.py:356: AssertionError

viktordick commented 3 years ago

Another error in the tests than sometimes happens:

===================================================== FAILURES ======================================================
_____________________________________________ TestSync.test_watch_move ______________________________________________

self = <perfact.zodbsync.tests.test_sync.TestSync object at 0x7f66180bcbe0>
conn = <perfact.zodbsync.helpers.Namespace object at 0x7f66180a3dc0>

    def test_watch_move(self, conn):
        """
        Create a Page Template, record it using the watcher, rename it and make
        sure the watcher notices. Then add a second one and do a
        three-way-rename in one transaction, making sure the watcher keeps
        track.
        """
        watcher = self.runner('watch')
        watcher.setup()
        root = self.repo.path + '/__root__/'
        src = '/__source-utf8__.html'
        app = conn.app

        add = app.manage_addProduct['PageTemplates'].manage_addPageTemplate
        rename = app.manage_renameObject

        with conn.tm:
            add(id='test1', text='test1')
        watcher.step()
        assert os.path.isdir(root + 'test1')

        # Not sure how to apply this specifically to the secondary connection
        # and why it is only needed for the rename and not the adding, but it
        # seems to do the job
        userfolder = conn.app.acl_users
        user = userfolder.getUser('perfact').__of__(userfolder)
        newSecurityManager(None, user)

        with conn.tm:
            rename('test1', 'test2')
        watcher.step()
        assert os.path.isdir(root + 'test2')
        assert not os.path.isdir(root + 'test1')

        with conn.tm:
            add(id='test1', text='test2')
        watcher.step()
>       assert os.path.isdir(root + 'test1')
E       AssertionError: assert False
E        +  where False = <function isdir at 0x7f661bf42160>(('/tmp/tmp6tuvwh1b/__root__/' + 'test1'))
E        +    where <function isdir at 0x7f661bf42160> = <module 'posixpath' from '/usr/lib/python3.9/posixpath.py'>.isdir
E        +      where <module 'posixpath' from '/usr/lib/python3.9/posixpath.py'> = os.path

perfact/zodbsync/tests/test_sync.py:395: AssertionError

viktordick commented 3 years ago

Some new information. Some of the failures are simply a timing issue (after commit on one connection, I can not expect that the other connection sees the result immediately). This I could fix by retrying with delay at the corresponding positions (PR will be created soon). This reduced the number of failures and the remaining situations are those that actually run into the failure mode - even with all following steps, the change is not noticed.

I then added a debugging trace to Watch._read_changed_oids to check which transaction ids are actually checked and a breakpoint at the point where the actual failure occurs.

The function was called with the following arguments (base64-encoded):

[(b'A99w2L4NU80=', b'A99w2L6Xosw='), (b'A99w2L6Xos0=', b'A99w2L6kpGY=')]

The undoInfo contains the following entries:

[{'description': '',
  'id': b'A99w2L6kpGY=',
  'size': 1348,
  'time': 1618667084.6819844,
  'user_name': ''},
 {'description': '',
  'id': b'A99w2L6Xosw=',
  'size': 1338,
  'time': 1618667084.6700768,
  'user_name': ''},
 {'description': '',
  'id': b'A99w2L4NU8w=',
  'size': 397,
  'time': 1618667084.5434513,
  'user_name': ''},
 {'description': 'Added default view for root object',
  'id': b'A99w2L4J/7s=',
  'size': 1676,
  'time': 1618667084.5404043,
  'user_name': ''},
 {'description': 'Added virtual_hosting',
  'id': b'A99w2L3pKt0=',
  'size': 829,
  'time': 1618667084.510346,
  'user_name': ''},
 {'description': '',
  'id': b'A99w2L3ez4g=',
  'size': 717,
  'time': 1618667084.500864,
  'user_name': ''},
 {'description': 'initial database creation',
  'id': b'A99w2LzemyI=',
  'size': 154,
  'time': 1618667084.2663016,
  'user_name': ''}]

This is at test_watch_move after renaming from test1 to test2. watcher.object_tree still contains test1 while watcher.app.objectIds() contains test2.

This means: There was a call to _read_changed_oids that checked the transaction A99w2L6kpGY= and should have seen the rename, but it did not see it.

The bug seems to be in _set_last_visible_txn. It yields a transaction ID that is not actually already visible.

viktordick commented 3 years ago

Another clue: This time, it failed after creating /test1 and not finding it on the disk:

Found 2 changed objects
Found 2 changed objects
Found 2 changed objects
OIDs: [b'\x00\x00\x00\x00\x00\x00\x00\x01', b'\x00\x00\x00\x00\x00\x00\x00e']
OIDs: [b'\x00\x00\x00\x00\x00\x00\x00\x01', b'\x00\x00\x00\x00\x00\x00\x00e']
OIDs: [b'\x00\x00\x00\x00\x00\x00\x00\x01', b'\x00\x00\x00\x00\x00\x00\x00e']
Recording /
Recording /
Recording /

The OID with \x00e at the end is indeed that of /test1. Somehow the watcher saw the object but only recorded /!

viktordick commented 3 years ago

All right, I think I found the origin of the problem:

import transaction
import ZEO

class Conn():
    def __init__(self):
        self.tm = transaction.TransactionManager()
        self.db = ZEO.DB('zeo.sock')
        self.conn = self.db.open(self.tm)
        self.app = self.conn.root.Application

c1 = Conn()
c2 = Conn()

# Clean up property if it exists
with c1.tm:
    if c1.app.hasProperty('testprop'):
        c1.app.manage_delProperties(ids=['testprop'])

# Open a transaction to try and see a consistent snapshot
c1.tm.begin()
# We see the current transaction
print(c1.db.lastTransaction())
# We see that the property is not set
print(c1.app.getProperty('testprop'))

# Using the second connection, set the property and commit
with c2.tm:
    c2.app.manage_addProperty('testprop', 'test', 'string')

# We do not see the property on the first connection - we are in the middle
# of an older transaction
print(c1.app.getProperty('testprop'))
# BUT lastTransaction() is not the transaction we are looking at, it is the
# newer one created by c2
print(c1.db.lastTransaction())
c1.tm.abort()

with c1.tm:
    print(c1.app.getProperty('testprop'))

The code in watch.py does not use lastTransaction() but undoInfo(), but the problem is the same. This does not give us the transaction we are actually currently looking at, it gives us the currently newest transaction that has been commited on the ZEO server - which might be newer.

viktordick commented 3 years ago

It seems there is another way to access the transaction number, one that actually shows the transaction number we are looking at - app._p_serial. This might be the fix.

viktordick commented 3 years ago

Unfortunately, this does not work. app._p_serial is not changed for new transactions unless these actually touch the app object.

perfact / zodbsync

Watcher sometimes recording old state #73