snarfed / bridgy-fed

🌉 A bridge between decentralized social network protocols
https://fed.brid.gy
Creative Commons Zero v1.0 Universal
562 stars 30 forks source link

ATProto: scale getRepo XRPC calls #1151

Closed snarfed closed 2 months ago

snarfed commented 3 months ago

Right now we have a few big repos that getRepo runs out of memory on and crashes because it ignores since and tries to load them fully into memory first. Should be straightforward.

(For now, snarfed/arroba@8d043690c0315e59d4b301fdf3763b2ccd3c8268 stopped the bleeding.)

snarfed commented 3 months ago

I think this will be based on MST diff, which I think we have fully built out in arroba, and maybe tested, but it's definitely not actually used yet, much less mature. Hrm.

snarfed commented 3 months ago

I implemented this a simpler way, but it's not working. 😕 We're now serving most of these requests ok, but the relay doesn't seem to like the CARs we're serving. We're (theoretically) not including blocks that were originally generated in other repos before since, background in https://github.com/snarfed/bridgy-fed/issues/1016#issuecomment-2118374522, maybe that's why? I suspect that's extremely rare, I kind of doubt we actually have many of those, if any, but I don't know for sure.

snarfed commented 3 months ago

https://atproto.tools/records isn't showing any recent records for these repos, which means it's the relay that's getting stuck on them, not the appview.

I'll try to grab someone from the Bluesky team and debug more with them.

snarfed commented 2 months ago

Not much luck there yet. A next idea here would be to try serving full getRepo responses for these repos specifically, maybe from the router service so they don't get deadlined. If that works, it'll be clear evidence that the since implementation is the problem.

snarfed commented 2 months ago

Going to try a different tack here and start tracking down invalid blocks and MST nodes I'm emitting. First up, reposts without subject, eg:

bafyreiaefu2zqyyj6zjpdy3wlnor7lujyugf4prirhegnevpxmk5f425sa
{
    "$type": "app.bsky.feed.repost",
    "createdAt": "2024-07-07T02:41:07.924Z",
    "subject": {
        "cid": "",
        "uri": ""
    }
}
snarfed commented 2 months ago

714c317dd473489b21ce38ea2117a960a791c4b7 is looking promising, haven't seen any more of those blank subject reposts since that was deployed. 🤞

snarfed commented 2 months ago

Trying yet another new since implementation in snarfed/arroba@57210738a78b9bb08cf29f3f1a23607cc3168e1f. Looking good so far, serving somewhat faster and cheaper. Still hasn't unstuck any of the handful of stuck repos I'm watching here though.

snarfed commented 2 months ago

New since implementation looks good, not a huge win, but still a win, and more importantly it fixed a number of the smaller stuck repos. As for the small handful of bigger stuck repos, eg https://bsky.app/profile/breakingnews.newsmast.community.ap.brid.gy , I had to reset them manually, but they're up and running again now too.

snarfed commented 2 months ago

Here are my raw notes from the two ways I tried to fix the bigger stuck repos:

Recreate repos (worked)

# first, delete DNS record. then:

a = ActivityPub.get_by_id('https://newsmast.community/users/uspolitics')
a.enabled_protocols=[]; a.copies=[]; a.put(); a.obj.copies=[]; a.obj.put()
a.enable_protocol(ATProto)

import arroba.server
arroba.server.storage.tombstone_repo(arroba.server.storage.load_repo('did:plc:...'))

Delete bad repost records (didn't work)

from arroba.repo import Write
from arroba.storage import Action

did = 'did:plc:...'
repo = AtpRepo.get_by_id(did)

AtpBlock.query(AtpBlock.repo == repo.key, AtpBlock.ops.action == 'create').count()
blocks = AtpBlock.query(AtpBlock.repo == repo.key, AtpBlock.ops.action == 'create').fetch()

start = AtpBlock.query(AtpBlock.repo == repo.key, AtpBlock.ops.action == 'create',
                       AtpBlock.ops.path == 'app.bsky.feed.post/3kwi6acz2q7a2').get()
start.seq
AtpBlock.query(AtpBlock.repo == repo.key, AtpBlock.seq > start.seq).count()
blocks = AtpBlock.query(AtpBlock.repo == repo.key, AtpBlock.seq > start.seq).fetch()

rkeys = set()
for block in sorted(blocks, lambda b: b.seq):
  for op in block.ops:
    coll, rkey = op.path.split('/')
    if coll == 'app.bsky.feed.repost':
      if op.action == 'create':
        rkeys.add(rkey)
      elif op.action == 'delete':
        rkeys.discard(rkey)

import arroba.server
r = arroba.server.storage.load_repo(did)

# takes minutes or longer on big repos
contents = r.get_contents()
  # r.mst.list(after='app.bsky.feed.repost', before='app.bsky.feed.reposu')

rkeys = [rkey for rkey, record in contents['app.bsky.feed.repost'].items()
         if not record['subject'].get('cid') or not record['subject'].get('uri')]

writes = [Write(action=Action.DELETE, collection='app.bsky.feed.repost', rkey=rkey)
          for rkey in rkeys]

# polyfill; this was only added in Python 3.12
from itertools import islice
def batched(iterable, n):
    if n < 1:
        raise ValueError('n must be at least one')
    iterator = iter(iterable)
    while batch := tuple(islice(iterator, n)):
        yield batch

# takes minutes or longer
for batch in batched(writes, 500):
  r.apply_writes(batch)
snarfed commented 2 months ago

I also reset:

I'd like to do these too, but they actually have followers, and I can't switch those on my own since they're in the followers' repos, which I can't write to. 😕