RFM Proposal: Number of Client nodes across various networks and implementations

yiannisbot commented 1 year ago

We are currently capturing the number of clients observed in the IPFS public DHT network and we report this as part of our weekly reports (currently in this repo - see example for Week 17 as well as at probelab.io: https://probelab.io/ipfsdht/#client-vs-server-node-estimate.

As per this discussion thread in Slack, this is great, but only captures part of the story, i.e., it focuses on the public IPFS DHT only, which in turn, means that it is mostly focusing on Kubo. However, IPFS is more than the kubo implementation and more than the public IPFS DHT. A request from @BigLep is to be able to "show the number of peer ids observed across various "networks" and break out by implementation".

In order to go about doing this, we'd need to identify data sources (i.e., how to collect the data) from different: i) IPFS implementations (e.g., Kubo, Helia, Iroh), and ii) networks that run IPFS nodes (e.g., the IPFS DHT, the Lotus DHT, cid.contact/IPNI, etc). We should also ideally deduplicate the PeerIDs to avoid double-counting a peer that participates in more than one network (?).

I'm starting this issue to capture first what we want to target and then come up with data collection ideas (e.g., through measurement tools, logs etc.).

cc: @BigLep @dennis-tra

dennis-tra commented 1 year ago

I think this is a great initiative and would be super insightful!

avoid double-counting a peer that participates in more than one network

I think the more common scenario would be that we might double-count peers across different data sources as opposed to a single peer participating in multiple networks.

BigLep commented 1 year ago

Thanks for creating this @yiannisbot. I'm pasting in some of the relevant info from FIL slack, in case someone can't easily access it:

Concerning Number of Client vs Server Nodes in the DHT

Pros
- Accuracy / comprehensiveness has gotten to a good state
- Can be generated automatically
Cons (or things it's not covering)
- Insight into implementation prevalance
- Just focused on the DHT-using nodes

I think addressing the cons is pretty important given themes of the last year that:

IPFS is intended to be more than Kubo. It would be great to show how that is actually going across various networks.
- (For example, when I looked at the bootstrapper breakdown at the beginning of April, there was far higher prevalence of js-ipfs than I expected - screenshot below).
IPFS usage is beyond the public DHT

Our current network size KPIs aren't helping drive home the message of the diversity of the IPFS project.

BigLep commented 1 year ago

Here is a mock of what I'm thinking: https://docs.google.com/spreadsheets/d/1SHHPBZEsZvZ95skg8MgRNHoSaog6tJ-DpvuZePZZlJ4/edit#gid=0

Specifically, I think we need to think about our metric collection from "network probes". If implementations don't identify themselves, they get bucketed as unknown/other.

For example:

Banana DHT clients and servers: PL-run bootstrappers
Banana DHT servers: nebula crawler
cid.contact IPNI: server access logs (and nodes should share some form of peerid ideally - presumably obfuscated for privacy)
Filecoin DHT: nebula crawler or Max's Kademlia explorer
For DHTs, we identify implementations from Identify protocol (or whatever we're doing now)
For HTTP endpoints, we identify implementations by user-agent HTTP header.
Lassie, Kubo, etc. should be identifiable by both of these means.

The graph above is for a single month. I could imagine showing that collection of bars grouped together for each month and then displaying multiple months along the x-axis.

BigLep commented 1 year ago

We should also ideally deduplicate the PeerIDs to avoid double-counting a peer that participates in more than one network (?)

I don't think this needs to be a priority currently. We can make it a caveat that nodes (peerids) will participate in multiple "networks" and that as a result, it is not accurate to say "the total number of unique IPFS peerds is the sum of all the bars". For example, I think it's fine for a Kubo peerId to count towards "Banana DHT server", "Banana DHT client", and "cid.contact IPNI".

I do think we should deduplicate peerIds within a given "network". For example, a Kubo node that participates as a "Banana DHT client" every day for a month should only increase the count for that month by 1 (not 30).

probe-lab / network-measurements

RFM Proposal: Number of Client nodes across various networks and implementations #45