Open mecheverri opened 3 years ago
@mecheverri This is interesting. Is there a put
happening during the same request that might hit the same key? Could it happen before the get
?
I added some logging to better understand what is happening. It appears that within a request, there aren't any calls to put()
the item whose key is mentioned in the RuntimeError.
Fortunately, on the chance that this exception might result from a race condition, I tried catching the exception and retrying the tasklet, and that seems to work.
I added some logging to better understand what is happening. It appears that within a request, there aren't any calls to
put()
the item whose key is mentioned in the RuntimeError.Fortunately, on the chance that this exception might result from a race condition, I tried catching the exception and retrying the tasklet, and that seems to work.
@mecheverri
Okay, that's helpful. Does your logging give you any insight into what calls are happening in a particular request context? I'm not entirely sure how to reproduce this. Any pointers you could give would be very helpful.
We have a high volume of requests and this appears to occur at random, which suggested the race condition theory. It occurs infrequently enough that I'm not able to identify a pattern regarding its cause. I have seen this so far with two different stack traces.
Race condition is certainly a possibility. The resource that would have that condition is attached to the context, which shouldn't be shared across threads, and should generally be created per request. Would your code that creates the context be something you could share?
I have run into this while trying to finish porting a fairly large python 2.7 appengine project to python 3.x. The following was with python 2.7.18 and python-ndb 1.11.2.
I've been able to reproduce this consistently in a test case, using the _InProcessGlobalCache.
I haven't tested yet with newer versions yet.
from google.cloud import ndb
client = ndb.Client("test")
class Foo(ndb.Model):
a = ndb.StringProperty()
with client.context(global_cache=ndb.global_cache._InProcessGlobalCache()):
print(ndb.get_context().global_cache.cache)
# => {}
ent = Foo(a="Foo")
ent.put()
print(ndb.get_context().global_cache.cache)
# => {}
ent.put()
print(ndb.get_context().global_cache.cache)
# => {'NDB30\n\x06\x12\x04test\x12\x07\n\x03Foo\x10\x01': ('', 1675475589.391862)}
# ndb.get_context().global_cache.clear()
ndb.get_context().cache.clear()
try:
ndb.get_multi([ent.key, ent.key])
# => RuntimeError("Key has already been set in this batch: ...
finally:
print(ndb.get_context().global_cache.cache)
# {'NDB30\n\x06\x12\x04test\x12\x07\n\x03Foo\x10\x01': ('\n\x11\n\x06\x12\x04test\x12\x07\n\x03Foo\x10\x01\x1a\x0b\n\x01a\x12\x06\x8a\x01\x03Foo', None)}
@chrisrossi any updates on this issue?
I don't work here anymore.
I think the issue here is we are creating a new lock on every get of a key. So for instance if I do something like this:
class SomeModel(ndb.Model):
pass
@ndb.tasklet
def temp():
yield (
SomeModel.get_by_id_async('temp'),
SomeModel.get_by_id_async('temp'),
)
ndb.synctasklet(temp)()
The first get_by_id
will register a todo with value ID + uuid_1
. The second will try to register a todo with value ID + uuid_2
and will fail on mismatch. I think the get should not use uuid
append for acquiring lock. There is no issue with the above code since both will use a single future for getting the value. It makes sense to do this in writes but not in reads IMO.
We are having load of these error too, after upgrading from version 1.9.0 to 2.2.2 The error seems to occur only places where we rely heavily on tasklets. I really don't know how to do any debugging here.
We are having load of these error too, after upgrading from version 1.9.0 to 2.2.2 The error seems to occur only places where we rely heavily on tasklets. I really don't know how to do any debugging here.
@anoteboom what I ended up doing was to use two patches: First for disabling this behavior for reads because honestly I don't see any need of this for the reads Second to disable for writes but log a message so that flow can be fixed. Once all the flows were fixed, this patch was removed
We have these patches up on production for 3 months and we haven't found any bugs due to this. We were also able to fix all the cases where there were duplicate writes of the same entities using the error log Following is the patch file I used:
from logging import error as log_error
from sys import modules
from google.cloud.datastore_v1.types.entity import Key
from google.cloud.ndb._cache import (
_GlobalCacheSetBatch,
global_compare_and_swap,
global_set_if_not_exists,
global_watch,
)
from google.cloud.ndb.tasklets import Future, Return, tasklet
GLOBAL_CACHE_KEY_PREFIX: bytes = modules["google.cloud.ndb._cache"]._PREFIX
LOCKED_FOR_READ: bytes = modules["google.cloud.ndb._cache"]._LOCKED_FOR_READ
LOCK_TIME: bytes = modules["google.cloud.ndb._cache"]._LOCK_TIME
@tasklet
def custom_global_lock_for_read(key: str, value: str):
if value is not None:
yield global_watch(key, value)
lock_acquired = yield global_compare_and_swap(
key, LOCKED_FOR_READ, expires=LOCK_TIME
)
else:
lock_acquired = yield global_set_if_not_exists(
key, LOCKED_FOR_READ, expires=LOCK_TIME
)
if lock_acquired:
raise Return(LOCKED_FOR_READ)
##############################################################################################################################################
# This is a temporary patch to disable the hard check of not allowing multiple puts for the same key in a batch with different values. We #
# are only logging the error here to be able to identify the areas where this issue exists. Once we have fixed all those cases, we will #
# remove this patch. #
##############################################################################################################################################
def custom_global_cache_set_batch_add(self: _GlobalCacheSetBatch, key: bytes, value):
future: Future = self.futures.get(key)
if future:
if self.todo[key] != value:
key_protobuf = Key()
key_protobuf._pb.ParseFromString(key.removeprefix(GLOBAL_CACHE_KEY_PREFIX))
deserialized_key = ", ".join(
[
line.strip(", ")
for line in str(key_protobuf.path).strip("[]").splitlines()
]
)
log_error("Key has already been set in this batch: '%s'", deserialized_key)
else:
return future
future = future or Future()
future.info = self.future_info(key, value)
self.todo[key] = value
self.futures[key] = future
return future
modules["google.cloud.ndb._cache"].global_lock_for_read = custom_global_lock_for_read
setattr(
modules["google.cloud.ndb._cache"]._GlobalCacheSetBatch,
"add",
custom_global_cache_set_batch_add,
)
@usmanmani1122 Thank you for that patchfile. This triggers me to look into this again. I'll share my experience
Confirmed! @usmanmani1122 your patch fixes this for me too. Thank you!
There is a bit of code that reads:
Ok, we're letting you know. We have Python 3.9 code running in Appengine and are seeing "Key has already been set in this batch" often enough that I felt compelled to check this issue. The expression in our code that triggers this is
ndb.get_multi(event.guestlists)
and the stack trace below that isI tried calling
ndb.get_multi()
, explicitly passing in duplicate keys, and it didn't trigger this issue.