DHT: improve data hosting metric

shyba commented 2 years ago

this issue is considered completed when the dashboard has an automatically-updating metric for how many TB of data is available for download

Today, we listen for queries under a shard (node id prefix) and calculate the data availability from what got announced vs the amount that got claimed. This is efficient but inaccurate, because:

background downloader does not announce
not every announcement is reachable
claim set size changes (minor)

This issue proposes a new way using the script from #3625 like:

pick 2 random bytes
query hub for all streams starting with those 2 random bytes (should be 270-350 claims)
actively search them (slowly to avoid flood)
check results (slowly)
calculate reachable / total sample size as % downloadable
calculate total results / total sample size as % theoretical maximum
2h should make it low impact while still probing 12 times a day

Not in this PR idea:

compare result sets from iterative find vs the script. This should give how well everything is working end-to-end.

fixes #3633

moodyjon commented 1 year ago

pick 2 random bytes query hub for all streams starting with those 2 random bytes (should be 270-350 claims)

Are you talking about searching by stream name or stream ID?

Claim names are human-meaningful, and the distribution of characters will not be uniform. The claim IDs would be uniformly random (IIUC) hex characters.

I worry that searching by name would produce widely varying numbers of claims (or claims that are correlated in some way).

shyba commented 1 year ago

Hello there,

LBRY DHT is based on Kademlia with sha384 hashing. Items are only searchable by content hash (sd_hash in a claim). This step searches the hub for sd_hashes samples. Check https://github.com/lbryio/lbry-sdk/blob/cc6cdc07f5067aa3a8e40b5421e0fd50fffbe0e7/scripts/sd_hash_sampler.py

lbryio / lbry-sdk

DHT: improve data hosting metric #3633