memcache not properly updated in some cases

GoogleCodeExporter commented 9 years ago

I ran into a situation where an enitity was updated in the datastore but not in 
the memcache. Only one unique spot in my code can make the edit that was 
present in the datastore, however that specific edit was not present in the 
memcache. The class has the @Cached annotation and everything has worked 
flawlessly until now. I haven't made any code changes related to this for a 
while.

I am wondering, is this some sort of bug with objectify or can this happen by 
design?
In the documentation it says 'Transactional reads bypass the cache.'.
Does this mean that regular reads that read from the memcache can return 
results that are not up-to-date? And I don't even mean seconds here, as I still 
got the old data from memcache 40 minutes after the actual write. I didn't wait 
to see if anything would change and manually updated the cache at that point. 
Or can concurrent writes get interleaved somehow so that write A is last for 
memcache, but first for datastore, while write B the opposite - so that 
memcache has data A, while datastore data B?

The non-verbose version of actual code:

Objectify ofyTxn = ObjectifyService.beginTransaction();
try {
    Game game = ofyTxn.find(Game.class, gameId);
    List<GamePlayer> gamePlayers = ofyTxn.query(GamePlayer.class).ancestor(game).list();

    // Do some edits

    ofyTxn.put(gamePlayers);
    ofyTxn.put(game);

    ofyTxn.getTxn().commit();
}
finally {
    if (ofyTxn.getTxn().isActive()) {
        ofyTxn.getTxn().rollback();
    }
}

objectify 2.2.3, GAE SDK 1.4.0, around 14:50 UTC 2011-02-03 - no errors in logs 
or other visible alert signs

Original issue reported on code.google.com by xxStrom on 3 Feb 2011 at 4:17

GoogleCodeExporter commented 9 years ago

Do you have a lot of contention on these objects?

There is no reason why this should not work just fine with the memcache.  
However, keep in mind that  memcache is not transactionally linked to the 
datastore - due to CPU scheduling, network lag within GAE, etc, it's fully 
possible for this to happen:

TxnA commits to datastore
TxnB commits to datastore
TxnB updates memcache
TxnA updates memcache

It's also possible for scenarios like this (but you'd see errors in the logs):

TxnA commits to datastore
TxnA makes call to update memcache... but gets cut off with 30s timeout or 
network error first.

If you have a lot of contention and you need solid transactional behavior, you 
probably shouldn't involve the memcache.  Or, if stale data is ok for some 
duration, put a time limit on the @Cached annotation.

If there wasn't a lot of contention, I'll be more concerned.  All the unit 
tests pass with caching turned on as well as off, and this case is covered.

Original comment by lhori...@gmail.com on 3 Feb 2011 at 7:20

GoogleCodeExporter commented 9 years ago

There is basically zero contention over this specific entity type.

In fact the only way a double-write can happen is when a previous request gets 
delayed about 25-30 seconds and even this can happen only in extraordinary 
cases, because these requests are usually at least a minute apart due to 
design. In fact, I'm fairly certain that there was no contention at that write 
moment, but it's not 100% guaranteed to be out of the realm of possibilites ;)

This specific case can handle stale data, because any actual code that does any 
changes based on state uses a transactional read anyway - that goes through 
memcache as we established. The reason I'm here is to understand whether I 
should keep this possibility in mind for future design or if it's a bug. Based 
on your answer it seems that in very rare cases the TxnA/TxnB interleaving can 
happen, which is good to know.

As an interesting turn of events, the pre-release of SDK 1.4.2 has this in its 
changelog:
- Added a putIfUntouched() method to the Memcache API. 
http://code.google.com/p/googleappengine/issues/detail?id=2139

I haven't investigated it in more detail yet, but could it perhaps be the 
solution to this interleaving race condition?

Original comment by xxStrom on 3 Feb 2011 at 9:45

GoogleCodeExporter commented 9 years ago

It sounds like this might be a bug.  I'll have to expand the test coverage of 
caching + transactions.

To make this even more fun, all the caching code was changed pretty radically 
for 3.0 due to the async stuff.

The approach to transactional integrity described in that issue would require a 
guid field in the data model.  It's probably possible to implement, but not by 
me anytime soon.

Original comment by lhori...@gmail.com on 3 Feb 2011 at 10:15

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

I see a similar cache out-of-sync issue, when performing a 'put'  followed by a 
'query', using a simple entity annotated with @Cache.

The pattern is a GWT RequestFactory client making calls in a typical 'user 
updates entry within list' operation.

request 1:
- findEntries(from:0,max:10), an objectify query

client: updates entry[0] from the list

request 2: persist
- find entry by id, objectify find.
- apply changes
- persist entry, objectify put
- find entry, objectify find (RequestFactory isLive check). 
- *1
- *2

client: responds to 'success', reloads list.

request 3
- findEntries(from:0,max:10)

If everything works request3 contains the updated entity. 

What I see is that cache appears to be out-of-sync occasionally when a 'query' 
is performed.  My impression from the documentation is that 'query' results are 
not cached though.

*1 Adding an additional 'find' operation shows that find returns correctly.
*2 Adding an additional 'query' confirms the query is out-of-sync with find 
within the same request.

(Removing the @Cached annotation causes everything works correctly).

objectify-2.2.3 / and  objectify-3.0 exhibit similar behaviour.

OSX 10.7.1
java 1.6.0_26
GAE 1.5.3

Original comment by markh.ne...@gmail.com on 24 Aug 2011 at 10:47

GoogleCodeExporter commented 9 years ago

Following up on Comment 5…

The issue with the above relates to the High Replication Datastore, not the 
cache, see: 
http://code.google.com/appengine/docs/java/datastore/hr/overview.html

My example is flawed, in that 'request 3' attempts to return 
'eventually-consistent' results immediately, through a 'query'.  This hits the 
datastore which is still in an inconsistent state following the 'put'.

The cache is correctly updated with the put.

Why this issue is only present with the @Cache annotation is more subtle. It's 
possible that it is related to the way in which the dev app server emulates 
inconsistency.
It's seems, that following a 'put'
- A 'find'/'get' will always returns correctly.
- If you 'put then find then query', you'll get consistent data in the query.
- If you 'put then query', you'll get inconsistent data. (The dev app server 
emulation of high replication).

Thus adding the @Cached annotation in the above example, caused all find 
operations to be consumed by the cache, leaving only the 'request 3' queries to 
hit the DB, 'put then query' and see eventually consistent data.

Re-running the query does 'eventually'  result in consistent data, it seems the 
dev app server spends ~15-20 seconds emulating the inconsistent state, (or I 
have another issue).

Original comment by markh.ne...@gmail.com on 25 Aug 2011 at 3:18

GoogleCodeExporter commented 9 years ago

The cache code was rewritten in Objectify 3.1 to use CAS operations.  It should 
be more or less transactionally linked to the datastore now - the only 
synchronization problem should be hard deadline exceeded exceptions.

Original comment by lhori...@gmail.com on 31 Dec 2011 at 5:05

Changed state: Fixed

google-code-export / objectify-appengine

memcache not properly updated in some cases #71