snarfed / bridgy-fed

🌉 A bridge between decentralized social network protocols
https://fed.brid.gy
Creative Commons Zero v1.0 Universal
726 stars 39 forks source link

Optimize costs #1149

Open snarfed opened 5 months ago

snarfed commented 5 months ago

Don't want to draw attention to this, I've been looking at it mostly behind the scenes, but I'd like to start tracking at least the investigation and work more publicly.

I expect there's plenty of low hanging fruit here. Biggest contributors right now are datastore reads and frontend instances, both of which I should be able to cut down. Biggest blocker right now is that I'm not sure what's driving the datastore read load, esp since I added a memcached instance a while back. Hrm.

month

year

snarfed commented 4 months ago

related: #1152 (but I expect outbox is a small fraction of overall cost)

snarfed commented 4 months ago

Looking at https://cloud.google.com/firestore/docs/audit-logging . I've turned on audit logs for Firestore/Datastore API and Access Approval in https://console.cloud.google.com/iam-admin/audit?project=bridgy-federated .

snarfed commented 4 months ago

Other leads:

snarfed commented 4 months ago

Got ~20h of datastore logs, digging into them in log analytics now. https://console.cloud.google.com/logs/analytics

snarfed commented 4 months ago

...maybe promising? Eg this query breaks down by API method:

SELECT DISTINCT
  proto_payload.audit_log.method_name as method_name,
  count(*) as count FROM `bridgy-federated.global._Default._Default`
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
group by method_name
order by count desc
LIMIT 1000

and this one samples the actual contents of queries:

SELECT proto_payload
FROM `bridgy-federated.global._Default._Default`
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery'
limit 200

I can't aggregate (group by) by fields inside proto_payload though, I get Grouping by expressions of type JSON is not allowed. Next step, try BigQuery to see if it can get around that.

snarfed commented 4 months ago

It's in BigQuery, https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1sbridgy-federated!2slogs!3s_Default, let's see how that goes. 37M rows total.

snarfed commented 4 months ago

Damn, BigQuery can't do it either. Maybe if I pull the JSON out in a view.

snarfed commented 3 months ago

Back to looking at this.

snarfed commented 3 months ago

Some useful datastore API usage analytics below, and already finding some useful results. Queries are ~40% of datastore API calls, and the vast majority of thoose are looking up original Objects and users for a given copy.

Query
SELECT
  DISTINCT proto_payload.audit_log.method_name as method_name, count(*) as count
FROM bridgy-federated.logs._AllLogs
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  group by method_name
  order by count desc
method_name count
Lookup 4374961
RunQuery 3596350
Commit 548773
BeginTx 353552
Rollback 23279
...
Query
SELECT
  string(proto_payload.audit_log.request.query.kind[0].name) as kind,
  ARRAY_LENGTH(JSON_QUERY_ARRAY(proto_payload.audit_log.request.query.filter.compositeFilter.filters)) as num_composite_filters,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery'
group by kind, num_composite_filters
order by count(*) desc
kind num_composites count
Object 767016
ATProto 764103
MagicKey 755585
ActivityPub 754627
UIProtocol 468614
Follower 2 49023
AtpBlock 17479
AtpRemoteBlob 6890
AtpRepo 3566
Follower 3147
AtpRepo 2 3108
ATProto 2 3107
...
Query
SELECT
  string(proto_payload.audit_log.request.query.kind[0].name) as kind,
  FilterStr(proto_payload.audit_log.request.query.filter) as filter,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery'
  and proto_payload.audit_log.request.query.filter.propertyFilter is not null
group by kind, filter
order by count(*) desc
kind filter count
ActivityPub copies.uri EQUAL 753344
MagicKey copies.uri EQUAL 753290
Object copies.uri EQUAL 753245
ATProto copies.uri EQUAL 753243
UIProtocol copies.uri EQUAL 468614
AtpBlock seq GREATER_THAN_OR_EQUAL 17479
Object users EQUAL 12535
AtpRemoteBlob cid EQUAL 6890
ATProto handle EQUAL 6516
MagicKey manual_opt_out EQ 2295
Follower from EQUAL 1575
Follower to EQUAL 1572
ATProto enabled_protocols NOT_EQUAL 1232
ActivityPub enabled_protocols NOT_EQUAL 1231
...
snarfed commented 3 months ago

^ should hopefully help with queries.

Here's the monitoring I'm watching: https://console.cloud.google.com/apis/api/datastore.googleapis.com/metrics?project=bridgy-federated&pageState=%28%22duration%22%3A%28%22groupValue%22%3A%22PT1H%22%2C%22customValue%22%3Anull%29%29

snarfed commented 3 months ago

Now looking at lookups aka gets. Surprising conclusion: the vast majority are for stored DID docs. Who knew.

Query
SELECT
  array_length(JSON_QUERY_ARRAY(proto_payload.audit_log.request.keys)) as num_keys,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup'
group by num_keys
order by count(*) desc
num_keys count
1 4371422
2 999
3 519
4 365
5 229
6 171
7 101
8 83
9 74
100 68
12 67
...
Query
SELECT
  string(proto_payload.audit_log.request.keys[0].path[0].kind) as kind,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup'
  and array_length(JSON_QUERY_ARRAY(proto_payload.audit_log.request.keys)) = 1
group by kind
order by count(*) desc
kind count
Object 4080965
ATProto 112141
AtpBlock 97120
MagicKey 38829
ActivityPub 29996
AtpRepo 7222
AtpSequence 3551
AtpRemoteBlob 1574
Cursor 24

Object lookups by id scheme:

Query
SELECT
  split(JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name), ':')[0] as scheme,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup'
  and string(proto_payload.audit_log.request.keys[0].path[0].kind) = 'Object'
group by scheme
order by count(*) desc
scheme count
did 3434021
at 413000
https 235446
http 85
...
snarfed commented 3 months ago

I don't understand why the DID Object lookups aren't using memcache. The ndb contexts in all three services seem to have it configured right. Here are the top DIDs by number of lookups, in just a ~9h window. Almost all of them should have been cached. Why weren't they?

Query
SELECT
  JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name) as name,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup'
  and string(proto_payload.audit_log.request.keys[0].path[0].kind) = 'Object'
  and split(JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name), ':')[0] = 'did'
group by name
order by count(*) desc
limit 100
name count
did:plc:ayutykgvyf4x7ev5ornltyzz 176211
did:plc:4mjwxpnhoeaknxqabwhf2n6i 64569
did:plc:p2ygpwluon3vrk5yecjq7wc5 56364
did:plc:hm5cxb2g2q4ma4ucsks73tex 53051
did:plc:b5glolrcdfnaxwc4dbe4zbty 43557
did:plc:3rxcq5wacdop5thjoh3sny3p 35103
did:plc:6zfgmvghpidjvcn3cqtitnx5 28309
did:plc:5syae7p7dr6zsfdoe4k6buky 26882
did:plc:dqgdfku26vpxdktkzne5x2xj 26240
did:plc:sese4fb6luojywziva3s7zjo 24876
...
snarfed commented 3 months ago

Progress here! Managed to cut datastore lookups and queries both way down, 5-10x each.

image image
snarfed commented 3 months ago

Now looking at optimizing log storage. We were doing 250G/mo a bit ago, we're now down to ~150G/mo or so.

The main cost here is initial ingest. We get 50G/mo for free, then $.50/G after that. That includes 30d retention, and our current retention period is set to 30d, so reducing retention wouldn't help. https://cloud.google.com/stackdriver/pricing#cloud-monitoring-pricing

Tips on optimizing logging costs: https://cloud.google.com/architecture/framework/cost-optimization/cloudops#logging

https://console.cloud.google.com/monitoring/dashboards/builder/24c22d42-91d8-4feb-aa6b-99dbb84c6417;duration=PT8H?project=bridgy-federated

image
snarfed commented 3 months ago

The other next cost to look at is CPU. router is currently on four cores, atproto-hub on one. We should be able to get them both down. Here's one place to start: https://cloud.google.com/profiler/docs

image
snarfed commented 3 months ago

Next step for datastore reads: https://github.com/snarfed/arroba/issues/30

snarfed commented 3 months ago

Results from the datastore reads optimization here have been disappointing. I cut them by ~3x, but spend on datastore read API calls only went down maybe 1/4-1/3, from $10-12/d to $7-8/d. Hrmph.

image image

(Datastore reads are the top blue part of the cost bars above.)

snarfed commented 3 months ago

Another opportunity here could be reducing router and atproto-hub allocated CPU. Router CPU is down to mostly under two cores, atproto-hub to maybe half a core. Could either drop router from 4 to 2, or leave it at 4 and merge atproto-hub into it.

image
snarfed commented 3 months ago

^ The difficulty with merging router and atproto-hub is that we currently run four WSGI workers for router, and we want to run the atproto-hub threads on just one of them. I had trouble running all the threads in just one worker on router a while back, but that may have been a red herring, it may actually work ok after all, and I doubt we're compute-bound in a way that we'd care about the GIL. Worth trying.

snarfed commented 3 months ago

Another idea here: stop storing and fetching transient activities (create, update, delete, undo, accept/reject, etc) in the datastore entirely. Just use memcache for them.

snarfed commented 3 months ago

...but looking at our incoming activity types, that might not make a big difference. The bulk of incoming activities are likes, follows, and reposts, in that order.

Measure first, then optimize.

image

(I should double check to make sure that count isn't after various filtering though!)

snarfed commented 3 months ago

Interesting, temporarily disabling AP outboxes for #1248 cut both our datastore read and frontend instance costs in half:

image
snarfed commented 3 months ago

The drop in datastore read cost due to outboxes was mostly queries, as we'd expect.

image
qazmlp commented 3 months ago

Out of curiosity, do you have logs on how much of that was due to repeat/failed fetching? Failing to get a user could probably cause an associated post to not be persisted either.

snarfed commented 3 months ago

This specific cost is our datastore, ie database. Not related to external fetches. (Which we don't really do during outbox rendering anyway.)

qazmlp commented 3 months ago

This specific cost is our datastore, ie database. Not related to external fetches. (Which we don't really do during outbox rendering anyway.)

Sorry, that was unclear.

I meant incoming fetches done by other instances, i.e. when they discover objects referenced in the outbox.

snarfed commented 3 months ago

Ah! I haven't looked specifically or tried to correlate, but all the drop here was in queries, and object fetches only do lookups.

snarfed commented 2 months ago

Starting to look at this again...

snarfed commented 2 months ago

Re-running analytics from https://github.com/snarfed/bridgy-fed/issues/1149#issuecomment-2259178748 :

Datastore API calls by method:

image

Lookups by kind:

image

Queries by kind:

image

Queries by filter:

image

...somehow we're back to overdoing copy queries again?!

snarfed commented 2 months ago

Object lookups by type:

image
snarfed commented 2 months ago

Lookups by key. We're still not caching DID docs anywhere near enough, like last time in https://github.com/snarfed/bridgy-fed/issues/1149#issuecomment-2261383697 .

image image
snarfed commented 2 months ago

Improving copies query caching got us a decent cut in RunQuery load, 9/19 ~9p on this graph. Next step there may be to memcache it.

image
snarfed commented 2 months ago

Still confused why lookups for DID docs, at:// objects, etc clearly aren't getting cached, either enough or at all: https://github.com/snarfed/bridgy-fed/issues/1149#issuecomment-2362264375

snarfed commented 2 months ago

1329 would help here too if it pans out.

snarfed commented 2 months ago

Big win from https://github.com/snarfed/bridgy-fed/issues/1329#issuecomment-2364716225 here!

snarfed commented 1 month ago

1354 reduced datastore load some:

image

Next step, stop storing non-object activities like CRUDs.

...and maybe switch send tasks to pass data in task HTTP payload too? Not sure yet.

snarfed commented 2 days ago

Another idea here: inline send into receive when we have few enough targets, maybe <5 or so.