Open dstadulis opened 5 months ago
to be answered: Are the elements that are being missed, actually being cached?
From offline discussion, I think cache was just too small. Worst-case size is 2x # of asset IDs / groups since there are issuance and transfer roots.
This was also slow to populate for me locally, which caused a timeout on the CLI command universe roots
.
Running 0.4.0 rc3, with the old "full sync" mode (I think?) connected to the testnet LL universe server:
2024-07-10 13:29:40.121 [INF] TADB: populating root cache...
2024-07-10 13:35:45.524 [INF] TADB: populating root cache...
2024-07-10 13:37:20.565 [DBG] TADB: Populating 512 root nodes into cache, took=1m35.040313762s
I think the cache gets dumped repeatedly during the process of sync for being too small?
Even after a successful universe roots
call, future calls claim to be repopulated the root cache and take the same amount of time.
Universe stats:
{
"num_total_assets": "1186",
"num_total_groups": "167",
"num_total_syncs": "5130",
"num_total_proofs": "103832"
}
Doubling the page size universe.MaxPageSize
helped a lot; not sure what % of the data returned is coming out of cache now though.
Update: that caused sync to the testnet server to fail; maybe the handling of the limit parameter is subtly wrong there?
2024-07-10 14:11:39.575 [ERR] RPCS: [/universerpc.Universe/SyncUniverse]: unable to sync universe: rpc error: code = Unknown desc = invalid request limit
The hard coded max page size of 512
: https://github.com/lightninglabs/taproot-assets/blob/d70bccd2714a3f808e070a080c510cf396a11284/universe/interface.go#L38-L42
The ''populating root cache" you see comes from this section: https://github.com/lightninglabs/taproot-assets/blob/d70bccd2714a3f808e070a080c510cf396a11284/tapdb/multiverse.go#L713-L732
Are those the remote Universe sever's logs, or your own?
In that case, it tries to read the cache based on the query, but then falls back to reading from the DB to populate. Def agree we can increase the max page size here though, as with 100k+ assets, 512 roots at a time isn't much (needed for the full-full sync).
After digging a bit more, I think we need to rethink the root cache design. So the allRoots
field which stores an LRU cache that maps a rootPageQuery
to the response.
The mainnet universe has around 170k assets, we have a page size of 512
, so that's ~335
queries to grab all the roots. The inner cache we have stores a mapping from rootPageQuery
to the slice of roots, this has a max size of 50k. Therefore, during the course of a single request, we'll purge most of the cache, then end up with the last 50k items in the cache. The next client comes along, and then we do the same thing all over again.
Based on the above analysis, I think we should just re-design this portion of the cache entirely. If we just cached the entire set of roots, then it's ~26MB, which is pretty manageable (32+8+32+32+32+1)* 172000
). We can then maintain the entire set of roots in memory, and use the RootNodesQuery
struct to slice into that, storing the amounts by ID somewhere else (cheaper to read w/o it, as it adds extra joins). For reverse, we offset the index.
The final thing we need to resolve with this design sketch is how we handle invalidation. For non grouped assets, the root is effectively immutable. For grouped assets, each time we insert a new issuance proof, we may need to update it. Then we also need to factor in the transfer roots, which can change for all asset types with each new transfer.
The ''populating root cache" you see comes from this section:
Are those the remote Universe sever's logs, or your own?
All the log snippets I posted were from my client.
I agree that just having all roots in memory makes sense; I think for invalidation, the DB queries to insert new data should already provide the ID that is also being used in the cache?
Rather, since we're using the same ID / integer for the DB and the query from the client, we should be able to query the DB for the entry representing the asset group, fetch the ID, and update only that entry.
Decoupling the page / request sizes for the RPC calls and the actual queries would also make sense IMO, but that would be less relevant if we're just keeping all roots in memory.
Logging at the original PR was useful to see all the 'touch points' for where the cache is actually populated, accessed, or invalidated:
During a recent deployment, the team observed that behavior where the universe would fill the cache again on each request. "cache eviction cycling"
Background
This issue relates to the cache performance of
universe_roots
with respect to interacting with the DB cache. Based on the logs from Lab's universe, the hit-to-misses ratio for this cache is exhibiting under performance. Remediating the issue will ensure performance and efficiency of universe functionality.Steps to reproduce [edit at will]
universe_roots
cache be populated with rootsExpected result:
The hit-to-misses ratio should be higher than [?80%]
Actual result:
The cache hit-to-misses ratio is insufficient -- at ~10%. Ramification question: Do the missed hits cause additional/supplemental/superfluous requests?
Possible solutions:
Please provide feedback or suggestions for further improvement after reviewing this issue.