snarfed commented 5 months ago

Don't want to draw attention to this, I've been looking at it mostly behind the scenes, but I'd like to start tracking at least the investigation and work more publicly.

I expect there's plenty of low hanging fruit here. Biggest contributors right now are datastore reads and frontend instances, both of which I should be able to cut down. Biggest blocker right now is that I'm not sure what's driving the datastore read load, esp since I added a memcached instance a while back. Hrm.

month

year

snarfed commented 4 months ago

related: #1152 (but I expect outbox is a small fraction of overall cost)

snarfed commented 4 months ago

Looking at https://cloud.google.com/firestore/docs/audit-logging . I've turned on audit logs for Firestore/Datastore API and Access Approval in https://console.cloud.google.com/iam-admin/audit?project=bridgy-federated .

snarfed commented 4 months ago

Other leads:

snarfed commented 4 months ago

Got ~20h of datastore logs, digging into them in log analytics now. https://console.cloud.google.com/logs/analytics

snarfed commented 4 months ago

...maybe promising? Eg this query breaks down by API method:

SELECT DISTINCT
  proto_payload.audit_log.method_name as method_name,
  count(*) as count FROM `bridgy-federated.global._Default._Default`
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
group by method_name
order by count desc
LIMIT 1000

and this one samples the actual contents of queries:

SELECT proto_payload
FROM `bridgy-federated.global._Default._Default`
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery'
limit 200

I can't aggregate (group by) by fields inside proto_payload though, I get Grouping by expressions of type JSON is not allowed. Next step, try BigQuery to see if it can get around that.

snarfed commented 4 months ago

It's in BigQuery, https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1sbridgy-federated!2slogs!3s_Default, let's see how that goes. 37M rows total.

snarfed commented 4 months ago

Damn, BigQuery can't do it either. Maybe if I pull the JSON out in a view.

snarfed commented 3 months ago

Back to looking at this.

snarfed commented 3 months ago

Some useful datastore API usage analytics below, and already finding some useful results. Queries are ~40% of datastore API calls, and the vast majority of thoose are looking up original Objects and users for a given copy.

Query

SELECT
  DISTINCT proto_payload.audit_log.method_name as method_name, count(*) as count
FROM bridgy-federated.logs._AllLogs
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  group by method_name
  order by count desc

method_name	count
Lookup	4374961
RunQuery	3596350
Commit	548773
BeginTx	353552
Rollback	23279
...

Query

SELECT
  string(proto_payload.audit_log.request.query.kind[0].name) as kind,
  ARRAY_LENGTH(JSON_QUERY_ARRAY(proto_payload.audit_log.request.query.filter.compositeFilter.filters)) as num_composite_filters,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery'
group by kind, num_composite_filters
order by count(*) desc

kind	num_composites	count
Object		767016
ATProto		764103
MagicKey		755585
ActivityPub		754627
UIProtocol		468614
Follower	2	49023
AtpBlock		17479
AtpRemoteBlob		6890
AtpRepo		3566
Follower		3147
AtpRepo		2	3108
ATProto		2	3107
...

Query

SELECT
  string(proto_payload.audit_log.request.query.kind[0].name) as kind,
  FilterStr(proto_payload.audit_log.request.query.filter) as filter,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery'
  and proto_payload.audit_log.request.query.filter.propertyFilter is not null
group by kind, filter
order by count(*) desc

kind	filter	count
ActivityPub	copies.uri EQUAL	753344
MagicKey	copies.uri EQUAL	753290
Object	copies.uri EQUAL	753245
ATProto	copies.uri EQUAL	753243
UIProtocol	copies.uri EQUAL	468614
AtpBlock	seq GREATER_THAN_OR_EQUAL	17479
Object	users EQUAL	12535
AtpRemoteBlob	cid EQUAL	6890
ATProto	handle EQUAL	6516
MagicKey	manual_opt_out EQ	2295
Follower	from EQUAL	1575
Follower	to EQUAL	1572
ATProto	enabled_protocols NOT_EQUAL	1232
ActivityPub	enabled_protocols NOT_EQUAL	1231
...

snarfed commented 3 months ago

^ should hopefully help with queries.

Here's the monitoring I'm watching: https://console.cloud.google.com/apis/api/datastore.googleapis.com/metrics?project=bridgy-federated&pageState=%28%22duration%22%3A%28%22groupValue%22%3A%22PT1H%22%2C%22customValue%22%3Anull%29%29

snarfed commented 3 months ago

Now looking at lookups aka gets. Surprising conclusion: the vast majority are for stored DID docs. Who knew.

Query

SELECT
  array_length(JSON_QUERY_ARRAY(proto_payload.audit_log.request.keys)) as num_keys,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup'
group by num_keys
order by count(*) desc

num_keys	count
1	4371422
2	999
3	519
4	365
5	229
6	171
7	101
8	83
9	74
100	68
12	67
...

Query

SELECT
  string(proto_payload.audit_log.request.keys[0].path[0].kind) as kind,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup'
  and array_length(JSON_QUERY_ARRAY(proto_payload.audit_log.request.keys)) = 1
group by kind
order by count(*) desc

kind	count
Object	4080965
ATProto	112141
AtpBlock	97120
MagicKey	38829
ActivityPub	29996
AtpRepo	7222
AtpSequence	3551
AtpRemoteBlob	1574
Cursor	24

Object lookups by id scheme:

Query

SELECT
  split(JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name), ':')[0] as scheme,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup'
  and string(proto_payload.audit_log.request.keys[0].path[0].kind) = 'Object'
group by scheme
order by count(*) desc

scheme	count
did	3434021
at	413000
https	235446
http	85
...

snarfed commented 3 months ago

I don't understand why the DID Object lookups aren't using memcache. The ndb contexts in all three services seem to have it configured right. Here are the top DIDs by number of lookups, in just a ~9h window. Almost all of them should have been cached. Why weren't they?

Query

SELECT
  JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name) as name,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup'
  and string(proto_payload.audit_log.request.keys[0].path[0].kind) = 'Object'
  and split(JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name), ':')[0] = 'did'
group by name
order by count(*) desc
limit 100

name	count
did:plc:ayutykgvyf4x7ev5ornltyzz	176211
did:plc:4mjwxpnhoeaknxqabwhf2n6i	64569
did:plc:p2ygpwluon3vrk5yecjq7wc5	56364
did:plc:hm5cxb2g2q4ma4ucsks73tex	53051
did:plc:b5glolrcdfnaxwc4dbe4zbty	43557
did:plc:3rxcq5wacdop5thjoh3sny3p	35103
did:plc:6zfgmvghpidjvcn3cqtitnx5	28309
did:plc:5syae7p7dr6zsfdoe4k6buky	26882
did:plc:dqgdfku26vpxdktkzne5x2xj	26240
did:plc:sese4fb6luojywziva3s7zjo	24876
...

snarfed commented 3 months ago

Progress here! Managed to cut datastore lookups and queries both way down, 5-10x each.

snarfed commented 3 months ago

Now looking at optimizing log storage. We were doing 250G/mo a bit ago, we're now down to ~150G/mo or so.

The main cost here is initial ingest. We get 50G/mo for free, then $.50/G after that. That includes 30d retention, and our current retention period is set to 30d, so reducing retention wouldn't help. https://cloud.google.com/stackdriver/pricing#cloud-monitoring-pricing

Tips on optimizing logging costs: https://cloud.google.com/architecture/framework/cost-optimization/cloudops#logging

https://console.cloud.google.com/monitoring/dashboards/builder/24c22d42-91d8-4feb-aa6b-99dbb84c6417;duration=PT8H?project=bridgy-federated

snarfed commented 3 months ago

The other next cost to look at is CPU. router is currently on four cores, atproto-hub on one. We should be able to get them both down. Here's one place to start: https://cloud.google.com/profiler/docs

snarfed commented 3 months ago

Next step for datastore reads: https://github.com/snarfed/arroba/issues/30

snarfed commented 3 months ago

Results from the datastore reads optimization here have been disappointing. I cut them by ~3x, but spend on datastore read API calls only went down maybe 1/4-1/3, from $10-12/d to $7-8/d. Hrmph.

(Datastore reads are the top blue part of the cost bars above.)

snarfed commented 3 months ago

Another opportunity here could be reducing router and atproto-hub allocated CPU. Router CPU is down to mostly under two cores, atproto-hub to maybe half a core. Could either drop router from 4 to 2, or leave it at 4 and merge atproto-hub into it.

snarfed commented 3 months ago

^ The difficulty with merging router and atproto-hub is that we currently run four WSGI workers for router, and we want to run the atproto-hub threads on just one of them. I had trouble running all the threads in just one worker on router a while back, but that may have been a red herring, it may actually work ok after all, and I doubt we're compute-bound in a way that we'd care about the GIL. Worth trying.

snarfed commented 3 months ago

Another idea here: stop storing and fetching transient activities (create, update, delete, undo, accept/reject, etc) in the datastore entirely. Just use memcache for them.

snarfed commented 3 months ago

...but looking at our incoming activity types, that might not make a big difference. The bulk of incoming activities are likes, follows, and reposts, in that order.

Measure first, then optimize.

(I should double check to make sure that count isn't after various filtering though!)

snarfed commented 3 months ago

Interesting, temporarily disabling AP outboxes for #1248 cut both our datastore read and frontend instance costs in half:

snarfed commented 3 months ago

The drop in datastore read cost due to outboxes was mostly queries, as we'd expect.

qazmlp commented 3 months ago

Out of curiosity, do you have logs on how much of that was due to repeat/failed fetching? Failing to get a user could probably cause an associated post to not be persisted either.

snarfed commented 3 months ago

This specific cost is our datastore, ie database. Not related to external fetches. (Which we don't really do during outbox rendering anyway.)

qazmlp commented 3 months ago

This specific cost is our datastore, ie database. Not related to external fetches. (Which we don't really do during outbox rendering anyway.)

Sorry, that was unclear.

I meant incoming fetches done by other instances, i.e. when they discover objects referenced in the outbox.

snarfed commented 3 months ago

Ah! I haven't looked specifically or tried to correlate, but all the drop here was in queries, and object fetches only do lookups.

snarfed commented 2 months ago

Starting to look at this again...

snarfed commented 2 months ago

Re-running analytics from https://github.com/snarfed/bridgy-fed/issues/1149#issuecomment-2259178748 :

Datastore API calls by method:

Lookups by kind:

Queries by kind:

Queries by filter:

...somehow we're back to overdoing copy queries again?!

snarfed commented 2 months ago

Object lookups by type:

snarfed commented 2 months ago

Lookups by key. We're still not caching DID docs anywhere near enough, like last time in https://github.com/snarfed/bridgy-fed/issues/1149#issuecomment-2261383697 .

snarfed commented 2 months ago

Improving copies query caching got us a decent cut in RunQuery load, 9/19 ~9p on this graph. Next step there may be to memcache it.

snarfed commented 2 months ago

Still confused why lookups for DID docs, at:// objects, etc clearly aren't getting cached, either enough or at all: https://github.com/snarfed/bridgy-fed/issues/1149#issuecomment-2362264375

snarfed commented 2 months ago

1329 would help here too if it pans out.

snarfed commented 2 months ago

Big win from https://github.com/snarfed/bridgy-fed/issues/1329#issuecomment-2364716225 here!

snarfed commented 1 month ago

1354 reduced datastore load some:

Next step, stop storing non-object activities like CRUDs.

...and maybe switch send tasks to pass data in task HTTP payload too? Not sure yet.

snarfed commented 2 days ago

Another idea here: inline send into receive when we have few enough targets, maybe <5 or so.

snarfed / bridgy-fed

Optimize costs #1149

1329 would help here too if it pans out.

1354 reduced datastore load some: