snarfed / bridgy-fed

🌉 A bridge between decentralized social network protocols
https://fed.brid.gy
Creative Commons Zero v1.0 Universal
576 stars 30 forks source link

AP => ATProto new user creation bottleneck on datastore #1376

Closed snarfed closed 1 hour ago

snarfed commented 2 hours ago

We got a bunch of new user signups from Flipboard this afternoon, ~9k during ~4:10-6:20p PT, ie roughly 1qps. Awesome!...and also a couple orders of magnitude higher than our usual incoming follow rate 😁, so it exposed a bottleneck in creating new ATProto users, specifically contention on the datastore transaction that creates the new repo and stores the initial commit and blocks. Example log below.

Task 39427228699028707821
AS2: {
  "actor": "https://flipboard.com/magazines/3zb0UW_iTeWO8Gz0um91Ew:m:2520005099",
  "id": "https://flipboard.com/70c2adc1-2719-4662-9401-ddf9dab03a0a",
  "object": "https://bsky.brid.gy/bsky.brid.gy",
  "type": "Follow",
  "url": "https://flipboard.com/magazines/3zb0UW_iTeWO8Gz0um91Ew:m:2520005099#followed-bsky.brid.gy"
}
Receiving activitypub follow https://flipboard.com/70c2adc1-2719-4662-9401-ddf9dab03a0a 
AS1: {
  "actor": "https://flipboard.com/magazines/3zb0UW_iTeWO8Gz0um91Ew:m:2520005099",
  "id": "https://flipboard.com/70c2adc1-2719-4662-9401-ddf9dab03a0a",
  "url": "https://flipboard.com/magazines/3zb0UW_iTeWO8Gz0um91Ew:m:2520005099#followed-bsky.brid.gy",
  "objectType": "activity",
  "verb": "follow",
  "object": "https://bsky.brid.gy/"
}
Follow of bot user, reloading https://flipboard.com/magazines/3zb0UW_iTeWO8Gz0um91Ew:m:2520005099
Signing with fed.brid.gy 's key
requests.get https://flipboard.com/magazines/3zb0UW_iTeWO8Gz0um91Ew:m:2520005099 {'data': None}
Received 200:
Received 200
Creating new did:plc for https://flipboard.com/magazines/3zb0UW_iTeWO8Gz0um91Ew:m:2520005099 woodworking-table-ideas-anikasdiylife.flipboard.com.ap.brid.gy https://atproto.brid.gy
...
adding GCP DNS TXT record for _atproto.woodworking-table-ideas-anikasdiylife.flipboard.com.ap.brid.gy. "did=did:plc:qynny5xtgq264h7q4nppzvoo"
Checking for existing record
done!
...
Storing ATProto app.bsky.actor.profile self
Storing ATProto chat declaration record

Traceback (most recent call last):
...
  File "oauth_dropins/webutil/flask_util.py", line 400, in decorator
    return fn(*args, **kwargs)
  File "/workspace/protocol.py", line 1679, in receive_task
    return PROTOCOLS[obj.source_protocol].receive(obj=obj, authed_as=authed_as,
  File "/workspace/protocol.py", line 995, in receive
    from_user.enable_protocol(proto)
  File "/workspace/models.py", line 555, in enable_protocol
    new_self = enable()
  File "google/cloud/ndb/_transaction.py", line 347, in transactional_inner_wrapper
    return transaction(
...
  File "/workspace/models.py", line 551, in enable
    to_proto.create_for(user)
  File "google/cloud/ndb/_transaction.py", line 347, in transactional_inner_wrapper
    return transaction(
...
  File "/workspace/atproto.py", line 428, in create_for
    repo = Repo.create(
  File "arroba/repo.py", line 216, in create
    return cls.create_from_commit(storage, commit_data, signing_key=signing_key,
  File "arroba/repo.py", line 178, in create_from_commit
    storage.apply_commit(commit_data)
  File "arroba/datastore_storage.py", line 416, in decorated
    ret = fn(self, *args, **kwargs)
  File "google/cloud/ndb/_transaction.py", line 347, in transactional_inner_wrapper
    return transaction(
...
  File "arroba/datastore_storage.py", line 564, in apply_commit
    repo = AtpRepo.get_by_id(commit['did'])
  File "google/cloud/ndb/_options.py", line 102, in wrapper
    return wrapped(*pass_args, **kwargs)
...
google.api_core.exceptions.RetryError: Maximum number of 3 retries exceeded while calling <function make_call.<locals>.rpc_call at 0x7f1219ed4220>, last exception: 409 too much contention on these datastore entities. please try again.
snarfed commented 2 hours ago

Ugh, ok, pretty obvious. We're unintentionally doing external network requests inside a datastore transaction here, specifically creating the new ATProto DID. OK, should at least be a straightforward fix.

snarfed commented 1 hour ago

^ Looks like this did the trick, datastore tx contention is down ~20x.

image
snarfed commented 1 hour ago

Total active user count over the last two days. Flipboard isn't messing around! Y axis is zoomed in so it's a bit misleading, but still 😆

image