Open snarfed opened 5 months ago
related: #1152 (but I expect outbox is a small fraction of overall cost)
Looking at https://cloud.google.com/firestore/docs/audit-logging . I've turned on audit logs for Firestore/Datastore API and Access Approval in https://console.cloud.google.com/iam-admin/audit?project=bridgy-federated .
Other leads:
Got ~20h of datastore logs, digging into them in log analytics now. https://console.cloud.google.com/logs/analytics
...maybe promising? Eg this query breaks down by API method:
SELECT DISTINCT
proto_payload.audit_log.method_name as method_name,
count(*) as count FROM `bridgy-federated.global._Default._Default`
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
group by method_name
order by count desc
LIMIT 1000
and this one samples the actual contents of queries:
SELECT proto_payload
FROM `bridgy-federated.global._Default._Default`
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery'
limit 200
I can't aggregate (group by) by fields inside proto_payload
though, I get Grouping by expressions of type JSON is not allowed. Next step, try BigQuery to see if it can get around that.
It's in BigQuery, https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1sbridgy-federated!2slogs!3s_Default, let's see how that goes. 37M rows total.
Damn, BigQuery can't do it either. Maybe if I pull the JSON out in a view.
Back to looking at this.
Some useful datastore API usage analytics below, and already finding some useful results. Queries are ~40% of datastore API calls, and the vast majority of thoose are looking up original Object
s and users for a given copy.
SELECT DISTINCT proto_payload.audit_log.method_name as method_name, count(*) as count FROM bridgy-federated.logs._AllLogs where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' group by method_name order by count desc
method_name | count |
---|---|
Lookup | 4374961 |
RunQuery | 3596350 |
Commit | 548773 |
BeginTx | 353552 |
Rollback | 23279 |
... |
SELECT string(proto_payload.audit_log.request.query.kind[0].name) as kind, ARRAY_LENGTH(JSON_QUERY_ARRAY(proto_payload.audit_log.request.query.filter.compositeFilter.filters)) as num_composite_filters, count(*) FROM bridgy-federated.logs._Default where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery' group by kind, num_composite_filters order by count(*) desc
kind | num_composites | count | |
---|---|---|---|
Object | 767016 | ||
ATProto | 764103 | ||
MagicKey | 755585 | ||
ActivityPub | 754627 | ||
UIProtocol | 468614 | ||
Follower | 2 | 49023 | |
AtpBlock | 17479 | ||
AtpRemoteBlob | 6890 | ||
AtpRepo | 3566 | ||
Follower | 3147 | ||
AtpRepo | 2 | 3108 | |
ATProto | 2 | 3107 | |
... |
SELECT string(proto_payload.audit_log.request.query.kind[0].name) as kind, FilterStr(proto_payload.audit_log.request.query.filter) as filter, count(*) FROM bridgy-federated.logs._Default where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery' and proto_payload.audit_log.request.query.filter.propertyFilter is not null group by kind, filter order by count(*) desc
kind | filter | count |
---|---|---|
ActivityPub | copies.uri EQUAL | 753344 |
MagicKey | copies.uri EQUAL | 753290 |
Object | copies.uri EQUAL | 753245 |
ATProto | copies.uri EQUAL | 753243 |
UIProtocol | copies.uri EQUAL | 468614 |
AtpBlock | seq GREATER_THAN_OR_EQUAL | 17479 |
Object | users EQUAL | 12535 |
AtpRemoteBlob | cid EQUAL | 6890 |
ATProto | handle EQUAL | 6516 |
MagicKey | manual_opt_out EQ | 2295 |
Follower | from EQUAL | 1575 |
Follower | to EQUAL | 1572 |
ATProto | enabled_protocols NOT_EQUAL | 1232 |
ActivityPub | enabled_protocols NOT_EQUAL | 1231 |
... |
^ should hopefully help with queries.
Here's the monitoring I'm watching: https://console.cloud.google.com/apis/api/datastore.googleapis.com/metrics?project=bridgy-federated&pageState=%28%22duration%22%3A%28%22groupValue%22%3A%22PT1H%22%2C%22customValue%22%3Anull%29%29
Now looking at lookups aka gets. Surprising conclusion: the vast majority are for stored DID docs. Who knew.
SELECT array_length(JSON_QUERY_ARRAY(proto_payload.audit_log.request.keys)) as num_keys, count(*) FROM bridgy-federated.logs._Default where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup' group by num_keys order by count(*) desc
num_keys | count |
---|---|
1 | 4371422 |
2 | 999 |
3 | 519 |
4 | 365 |
5 | 229 |
6 | 171 |
7 | 101 |
8 | 83 |
9 | 74 |
100 | 68 |
12 | 67 |
... |
SELECT string(proto_payload.audit_log.request.keys[0].path[0].kind) as kind, count(*) FROM bridgy-federated.logs._Default where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup' and array_length(JSON_QUERY_ARRAY(proto_payload.audit_log.request.keys)) = 1 group by kind order by count(*) desc
kind | count |
---|---|
Object | 4080965 |
ATProto | 112141 |
AtpBlock | 97120 |
MagicKey | 38829 |
ActivityPub | 29996 |
AtpRepo | 7222 |
AtpSequence | 3551 |
AtpRemoteBlob | 1574 |
Cursor | 24 |
Object lookups by id scheme:
SELECT split(JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name), ':')[0] as scheme, count(*) FROM bridgy-federated.logs._Default where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup' and string(proto_payload.audit_log.request.keys[0].path[0].kind) = 'Object' group by scheme order by count(*) desc
scheme | count |
---|---|
did | 3434021 |
at | 413000 |
https | 235446 |
http | 85 |
... |
I don't understand why the DID Object
lookups aren't using memcache. The ndb contexts in all three services seem to have it configured right. Here are the top DIDs by number of lookups, in just a ~9h window. Almost all of them should have been cached. Why weren't they?
SELECT JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name) as name, count(*) FROM bridgy-federated.logs._Default where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup' and string(proto_payload.audit_log.request.keys[0].path[0].kind) = 'Object' and split(JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name), ':')[0] = 'did' group by name order by count(*) desc limit 100
name | count |
---|---|
did:plc:ayutykgvyf4x7ev5ornltyzz | 176211 |
did:plc:4mjwxpnhoeaknxqabwhf2n6i | 64569 |
did:plc:p2ygpwluon3vrk5yecjq7wc5 | 56364 |
did:plc:hm5cxb2g2q4ma4ucsks73tex | 53051 |
did:plc:b5glolrcdfnaxwc4dbe4zbty | 43557 |
did:plc:3rxcq5wacdop5thjoh3sny3p | 35103 |
did:plc:6zfgmvghpidjvcn3cqtitnx5 | 28309 |
did:plc:5syae7p7dr6zsfdoe4k6buky | 26882 |
did:plc:dqgdfku26vpxdktkzne5x2xj | 26240 |
did:plc:sese4fb6luojywziva3s7zjo | 24876 |
... |
Progress here! Managed to cut datastore lookups and queries both way down, 5-10x each.
Now looking at optimizing log storage. We were doing 250G/mo a bit ago, we're now down to ~150G/mo or so.
The main cost here is initial ingest. We get 50G/mo for free, then $.50/G after that. That includes 30d retention, and our current retention period is set to 30d, so reducing retention wouldn't help. https://cloud.google.com/stackdriver/pricing#cloud-monitoring-pricing
Tips on optimizing logging costs: https://cloud.google.com/architecture/framework/cost-optimization/cloudops#logging
The other next cost to look at is CPU. router is currently on four cores, atproto-hub on one. We should be able to get them both down. Here's one place to start: https://cloud.google.com/profiler/docs
Next step for datastore reads: https://github.com/snarfed/arroba/issues/30
Results from the datastore reads optimization here have been disappointing. I cut them by ~3x, but spend on datastore read API calls only went down maybe 1/4-1/3, from $10-12/d to $7-8/d. Hrmph.
(Datastore reads are the top blue part of the cost bars above.)
Another opportunity here could be reducing router and atproto-hub allocated CPU. Router CPU is down to mostly under two cores, atproto-hub to maybe half a core. Could either drop router from 4 to 2, or leave it at 4 and merge atproto-hub into it.
^ The difficulty with merging router and atproto-hub is that we currently run four WSGI workers for router, and we want to run the atproto-hub threads on just one of them. I had trouble running all the threads in just one worker on router a while back, but that may have been a red herring, it may actually work ok after all, and I doubt we're compute-bound in a way that we'd care about the GIL. Worth trying.
Another idea here: stop storing and fetching transient activities (create, update, delete, undo, accept/reject, etc) in the datastore entirely. Just use memcache for them.
...but looking at our incoming activity types, that might not make a big difference. The bulk of incoming activities are likes, follows, and reposts, in that order.
Measure first, then optimize.
(I should double check to make sure that count isn't after various filtering though!)
Interesting, temporarily disabling AP outboxes for #1248 cut both our datastore read and frontend instance costs in half:
The drop in datastore read cost due to outboxes was mostly queries, as we'd expect.
Out of curiosity, do you have logs on how much of that was due to repeat/failed fetching? Failing to get a user could probably cause an associated post to not be persisted either.
This specific cost is our datastore, ie database. Not related to external fetches. (Which we don't really do during outbox rendering anyway.)
This specific cost is our datastore, ie database. Not related to external fetches. (Which we don't really do during outbox rendering anyway.)
Sorry, that was unclear.
I meant incoming fetches done by other instances, i.e. when they discover objects referenced in the outbox.
Ah! I haven't looked specifically or tried to correlate, but all the drop here was in queries, and object fetches only do lookups.
Starting to look at this again...
Re-running analytics from https://github.com/snarfed/bridgy-fed/issues/1149#issuecomment-2259178748 :
Datastore API calls by method:
Lookups by kind:
Queries by kind:
Queries by filter:
...somehow we're back to overdoing copy queries again?!
Object lookups by type:
Lookups by key. We're still not caching DID docs anywhere near enough, like last time in https://github.com/snarfed/bridgy-fed/issues/1149#issuecomment-2261383697 .
Improving copies query caching got us a decent cut in RunQuery
load, 9/19 ~9p on this graph. Next step there may be to memcache it.
Still confused why lookups for DID docs, at://
objects, etc clearly aren't getting cached, either enough or at all: https://github.com/snarfed/bridgy-fed/issues/1149#issuecomment-2362264375
Big win from https://github.com/snarfed/bridgy-fed/issues/1329#issuecomment-2364716225 here!
Next step, stop storing non-object activities like CRUDs.
...and maybe switch send tasks to pass data in task HTTP payload too? Not sure yet.
Another idea here: inline send into receive when we have few enough targets, maybe <5 or so.
Don't want to draw attention to this, I've been looking at it mostly behind the scenes, but I'd like to start tracking at least the investigation and work more publicly.
I expect there's plenty of low hanging fruit here. Biggest contributors right now are datastore reads and frontend instances, both of which I should be able to cut down. Biggest blocker right now is that I'm not sure what's driving the datastore read load, esp since I added a memcached instance a while back. Hrm.