snarfed / arroba

Python implementation of Bluesky PDS and AT Protocol, including repo, MST, and sync XRPC methods
https://arroba.readthedocs.io
Creative Commons Zero v1.0 Universal
45 stars 1 forks source link

memory leak in `subscribeRepos` rollback window #39

Open snarfed opened 3 months ago

snarfed commented 3 months ago

atproto-hub hung itself just now. Evidently we made and emitted a ton of commits all of a sudden, >20qps sustained during 10:45-11:15a PT, so ~36k total. Sheesh.

image image image image
snarfed commented 3 months ago

Restarting seems to have fixed it, but atproto-hub CPU is pegged at 100% working through the backlog right now, so we're not out of the woods just yet.

snarfed commented 3 months ago

So weird. I don't see a pattern in the usage spike yet. Mostly posts, from a range of users and AP instances and web sites. A few examples from 11:05-11:15a:

Doesn't look like we were backed up and then suddenly caught up either.

image
snarfed commented 3 months ago

Out of the woods, everything looks back to normal. Hrmph.

snarfed commented 2 months ago

Happened again just now due to the influx of Brazil users and usage.

image

image
snarfed commented 2 months ago

Seems like a memory leak. atproto-hub's memory footprint is constant when it's caught up, but increases linearly, quickly, when it's behind. ☹️ Not 100% sure if this is in our firehose server or our client. subscribeRepos clients are reconnecting often right now, every 1-10m, but we're consistently catching up from their cursor and then serving new commits in realtime, so I suspect the memory leak is in our client.

image
snarfed commented 2 months ago

Related: snarfed/bridgy-fed#1295

snarfed commented 1 week ago

Haven't seen this since we optimized and switched from dag_cbor to libipld. Tentatively closing.

snarfed commented 1 day ago

Reopening, still happening. Only when we're behind serving events over our firehose, so it's hard to debug, but definitely happening right now. 😕

snarfed commented 1 day ago

Bumping hub memory up to 6G as a band-aid.

snarfed commented 1 day ago

Ugh, we're flapping:

image
snarfed commented 6 hours ago

I'm pretty confident this is in the rollback window part of subscribeRepos:

https://github.com/snarfed/arroba/blob/69846b5495a776647c36877968fa2942c80c2dce/arroba/xrpc_sync.py#L179-L189

https://github.com/snarfed/arroba/blob/69846b5495a776647c36877968fa2942c80c2dce/arroba/storage.py#L309-L325

https://github.com/snarfed/arroba/blob/69846b5495a776647c36877968fa2942c80c2dce/arroba/datastore_storage.py#L528-L554

Moving this issue to the arroba repo.

snarfed commented 6 hours ago

Recent example, two clients from the same IP connected to our subscribeRepos at the same time with a ~4h old cursor. We leaked memory while we were serving them events from the rollback window, and then reclaimed that memory as soon as we caught up and switched to live.

image image
snarfed commented 6 hours ago

I wonder if this is our tracking of seen CIDs in Storage.read_events_by_seq? Doesn't seem like that should be too big, just the CIDs of each emitted block in the rollback window, but that could still add up. Worth looking at.