[bug]: `universe_roots` cache-hit ratio, gross underperformance

dstadulis commented 5 months ago

Background

This issue relates to the cache performance of universe_roots with respect to interacting with the DB cache. Based on the logs from Lab's universe, the hit-to-misses ratio for this cache is exhibiting under performance. Remediating the issue will ensure performance and efficiency of universe functionality.

│ 2024-04-01 23:32:09.632 [INF] TADB: db cache universe_roots: 260271 hits, 2225329 misses, 10.47% hit ratio

Steps to reproduce [edit at will]

Initialize a universe
Let the universe_roots cache be populated with roots
Observe the cache usage over a period of time (e.g., minutes)
Check the cache hit-to-misses ratio in the logs or monitoring dashboard

Expected result:

The hit-to-misses ratio should be higher than [?80%]

Actual result:

The cache hit-to-misses ratio is insufficient -- at ~10%. Ramification question: Do the missed hits cause additional/supplemental/superfluous requests?

Possible solutions:

Analyze the current cache configuration to identify any optimization opportunities.
Investigate if inefficiencies in the caching logic that need to be fixed.

Please provide feedback or suggestions for further improvement after reviewing this issue.

dstadulis commented 5 months ago

to be answered: Are the elements that are being missed, actually being cached?

jharveyb commented 5 months ago

From offline discussion, I think cache was just too small. Worst-case size is 2x # of asset IDs / groups since there are issuance and transfer roots.

jharveyb commented 2 months ago

This was also slow to populate for me locally, which caused a timeout on the CLI command universe roots.

Running 0.4.0 rc3, with the old "full sync" mode (I think?) connected to the testnet LL universe server:

2024-07-10 13:29:40.121 [INF] TADB: populating root cache...
2024-07-10 13:35:45.524 [INF] TADB: populating root cache...
2024-07-10 13:37:20.565 [DBG] TADB: Populating 512 root nodes into cache, took=1m35.040313762s

I think the cache gets dumped repeatedly during the process of sync for being too small?

Even after a successful universe roots call, future calls claim to be repopulated the root cache and take the same amount of time.

Universe stats:

{
    "num_total_assets": "1186",
    "num_total_groups": "167",
    "num_total_syncs": "5130",
    "num_total_proofs": "103832"
}

jharveyb commented 2 months ago

Doubling the page size universe.MaxPageSize helped a lot; not sure what % of the data returned is coming out of cache now though.

Update: that caused sync to the testnet server to fail; maybe the handling of the limit parameter is subtly wrong there?

2024-07-10 14:11:39.575 [ERR] RPCS: [/universerpc.Universe/SyncUniverse]: unable to sync universe: rpc error: code = Unknown desc = invalid request limit

Roasbeef commented 2 months ago

The hard coded max page size of 512: https://github.com/lightninglabs/taproot-assets/blob/d70bccd2714a3f808e070a080c510cf396a11284/universe/interface.go#L38-L42

Roasbeef commented 2 months ago

The ''populating root cache" you see comes from this section: https://github.com/lightninglabs/taproot-assets/blob/d70bccd2714a3f808e070a080c510cf396a11284/tapdb/multiverse.go#L713-L732

Are those the remote Universe sever's logs, or your own?

In that case, it tries to read the cache based on the query, but then falls back to reading from the DB to populate. Def agree we can increase the max page size here though, as with 100k+ assets, 512 roots at a time isn't much (needed for the full-full sync).

Roasbeef commented 2 months ago

After digging a bit more, I think we need to rethink the root cache design. So the allRoots field which stores an LRU cache that maps a rootPageQuery to the response.

The mainnet universe has around 170k assets, we have a page size of 512, so that's ~335 queries to grab all the roots. The inner cache we have stores a mapping from rootPageQuery to the slice of roots, this has a max size of 50k. Therefore, during the course of a single request, we'll purge most of the cache, then end up with the last 50k items in the cache. The next client comes along, and then we do the same thing all over again.

Based on the above analysis, I think we should just re-design this portion of the cache entirely. If we just cached the entire set of roots, then it's ~26MB, which is pretty manageable (32+8+32+32+32+1)* 172000). We can then maintain the entire set of roots in memory, and use the RootNodesQuery struct to slice into that, storing the amounts by ID somewhere else (cheaper to read w/o it, as it adds extra joins). For reverse, we offset the index.

The final thing we need to resolve with this design sketch is how we handle invalidation. For non grouped assets, the root is effectively immutable. For grouped assets, each time we insert a new issuance proof, we may need to update it. Then we also need to factor in the transfer roots, which can change for all asset types with each new transfer.

jharveyb commented 2 months ago

The ''populating root cache" you see comes from this section:

https://github.com/lightninglabs/taproot-assets/blob/d70bccd2714a3f808e070a080c510cf396a11284/tapdb/multiverse.go#L713-L732

Are those the remote Universe sever's logs, or your own?

All the log snippets I posted were from my client.

I agree that just having all roots in memory makes sense; I think for invalidation, the DB queries to insert new data should already provide the ID that is also being used in the cache?

Rather, since we're using the same ID / integer for the DB and the query from the client, we should be able to query the DB for the entry representing the asset group, fetch the ID, and update only that entry.

Decoupling the page / request sizes for the RPC calls and the actual queries would also make sense IMO, but that would be less relevant if we're just keeping all roots in memory.

jharveyb commented 2 months ago

Logging at the original PR was useful to see all the 'touch points' for where the cache is actually populated, accessed, or invalidated:

https://github.com/lightninglabs/taproot-assets/pull/626/

dstadulis commented 2 months ago

During a recent deployment, the team observed that behavior where the universe would fill the cache again on each request. "cache eviction cycling"

lightninglabs / taproot-assets