Open snarfed opened 3 months ago
Restarting seems to have fixed it, but atproto-hub CPU is pegged at 100% working through the backlog right now, so we're not out of the woods just yet.
So weird. I don't see a pattern in the usage spike yet. Mostly posts, from a range of users and AP instances and web sites. A few examples from 11:05-11:15a:
Doesn't look like we were backed up and then suddenly caught up either.
Out of the woods, everything looks back to normal. Hrmph.
Happened again just now due to the influx of Brazil users and usage.
Seems like a memory leak. atproto-hub's memory footprint is constant when it's caught up, but increases linearly, quickly, when it's behind. ☹️ Not 100% sure if this is in our firehose server or our client. subscribeRepos
clients are reconnecting often right now, every 1-10m, but we're consistently catching up from their cursor and then serving new commits in realtime, so I suspect the memory leak is in our client.
Related: snarfed/bridgy-fed#1295
Haven't seen this since we optimized and switched from dag_cbor to libipld. Tentatively closing.
Reopening, still happening. Only when we're behind serving events over our firehose, so it's hard to debug, but definitely happening right now. 😕
Bumping hub memory up to 6G as a band-aid.
Ugh, we're flapping:
I'm pretty confident this is in the rollback window part of subscribeRepos
:
Moving this issue to the arroba repo.
Recent example, two clients from the same IP connected to our subscribeRepos
at the same time with a ~4h old cursor. We leaked memory while we were serving them events from the rollback window, and then reclaimed that memory as soon as we caught up and switched to live.
I wonder if this is our tracking of seen CIDs in Storage.read_events_by_seq
? Doesn't seem like that should be too big, just the CIDs of each emitted block in the rollback window, but that could still add up. Worth looking at.
atproto-hub hung itself just now. Evidently we made and emitted a ton of commits all of a sudden, >20qps sustained during 10:45-11:15a PT, so ~36k total. Sheesh.