Track number of client nodes in the IPFS DHT Network

yiannisbot commented 1 year ago

Summarising several approaches from offband discussions here to have them documented.

Approach 1: kubo README file - idea initially circulated by @BigLep

Description: The kubo README file is stored and advertised by every node in the network (https://github.com/ipfs/kubo/pull/9590#issuecomment-1419192459), regardless of whether the node is a client or a server in the beginning. The provider records for this README are becoming stale after a while, either because peers are categorised as clients (and are therefore unreachable), or because the leave the network (churn). But the records are still there until they expire. We could count the number of providers across the network for the kubo README CID and approximate the network-wide client vs server ratio. Downside: This approach would only count kubo nodes (which is a good start and likely the vast majority of clients).

Approach 2: Honeypot - idea circulated by @dennis-tra

Description: We have:

the honeypot that tracks inbound connections/time,
the crawls that give us information about in how many routing tables the honeypot is.

Maybe we can estimate what share of queries should come across the honeypot and then estimate the total number of clients in the network, based on the number of unique clients the honeypot sees. This would be a low overhead setup and may allow better estimates with more honeypots. Downside: The approach would need maintenance and infrastructure cost of the honeypot(s).

Approach 3: Baby-Hydras - idea circulated by @guillaumemichel

Description: Another approximation we could get is by running multiple DHT servers. Think of a few baby hydras. Each DHT server would log all peerids sending DHT requests, and get the % of client vs servers by correlating the logs with crawls results. This gives the % of clients and servers observed, we average the results of all DHT servers, and extrapolate this number to get the total number of client, given that we know the total number of servers. Downside: The approach would need maintenance and infrastructure cost of the DHT servers/baby-hydras.

Approach 4: Bootstrapper + Nebula - info gathered by @yiannisbot

Description: We capture the total number of Unique PeerIDs through the bootstrapper. What this gives us is the "Total number of nodes that joined the network as either clients or servers". Given that we have the total number of DHT server nodes from the Nebula crawler, we can have a pretty good estimation of the number of clients that join the network. The calculation would simply be: Total number of Unique PeerIDs (seen by bootstrappers) - DHT Server PeerIDs (found by Nebula). In this case, clients will include: other non-kubo clients (whether based on the Go IPFS codebase, Iroh, etc.) and js-ipfs based ones too (nodejs, and maybe browser, although the browser ones shouldn't be talking to the bootstrappers anyway). Downside: We rely on data from a central point - the bootstrappers.

Approach 4 seems like the easiest to get us quick results. All of the rest would be good to have to compare results and have extra data points.

Any other views, or suggested approaches?

guillaumemichel commented 1 year ago

Concerning approach 1: Is there a limit in the number of Provider Records that a DHT Server can (1) store or (2) give in a DHT lookup response?

Concerning approach 2: How would the nodes get to contact the honeypot?

lidel commented 1 year ago

In case you want to explore a more generalized version of Approach 1, the majority of active client nodes may fetch at some point and be reproviding empty objects:

QmUNLLsPACCz1vLxQVkXqqLX5R1X345qqfHbsf67hvA3Nn (empty unixfs directory)
bafkreihdwdcefgh4dqkjv67uzcmw7ojee6xedzdetojuzjevtenxquvyku (empty raw block)
baguqeeraiqjw7i2vwntyuekgvulpp2det2kpwt6cd7tx5ayqybqpmhfk76fa (empty dag-json)
bafyreigbtj4x7ip5legnfznufuopl4sg4knzc2cof6duas4b3q2fy6swua (empty dag-cbor)

This comes with a nice side-effect of catching real nodes that announce their block caches, and less likely random peerids from CI etc.

BigLep commented 1 year ago

Noting that approach 4 is what is being followed for "Number of Client vs Server Nodes in the DHT" in https://www.notion.so/pl-strflt/IPFS-KPIs-f331f51033cc45979d5ccf50f591ee01?pvs=4#ce43d82d30b94de0848c71a9fad414ab

yiannisbot commented 1 year ago

Closing this issue as for now we're following approach 4 above.

The plot is reported at: https://probelab.io/ipfsdht/#client-vs-server-node-estimate, as well as our weekly reports at: https://github.com/protocol/network-measurements/tree/master/reports/2023 (example for Week 17 2023).
The details about the metric will eventually be included there, but they're also given at: https://www.notion.so/pl-strflt/IPFS-KPIs-f331f51033cc45979d5ccf50f591ee01?pvs=4#8a7a8d6e970e48249bf6cb6255063608

If we end up using a different approach in the future (e.g., when nodes persist their routing tables upon restart and bootstrappers end up capturing only new nodes joining), or want to get a more holistic view of clients in the IPFS network (e.g., as per: https://github.com/protocol/network-measurements/issues/45), we'll re-open the issue, if needed.

probe-lab / network-measurements