[X] I have searched on the issue tracker for my issue.
Description
Checklist
[X] My issue is specific & actionable.
[X] I am not suggesting a protocol enhancement.
[X] I have searched on the issue tracker for my issue.
Description
This is the corresponding issue for a client side change to mitigate the impact of large numbers of unreachable providers.
Context
The context is explained in #9982.
Proposal
We identified two ways forward to address the impact of unreachable providers.
A prioritization logic of provider records on the server side. The peers that serve provider records sort them in such a way that, e.g., the first one in the list likely contains a peer that is actually reachable.
A delayed provider record publication. E.g. only announce blocks if a peer was online for some time. The assumption is that this will filter out rather short-lived peers.
This GH issue is for proposal 2).
The discussion around 1) happens in #9982
1. Delay Provider Record Publication
The idea here is that we change the default configuration to only reprovide blocks after a Kubo node has had a consecutive uptime of X minutes/hours/days. The assumption is that nodes which have been online for a long time straight, will likely stay online and are stable.
Today, block reproviding is a global flag in Kubo (IPFS Desktop, Brave): we do not distinguish between blocks fetched while browsing websites (temporarily stored in the cache), and blocks imported by user by adding their own data to local node (either pinned, in MFS, or just in cache). Both types of data are stored and reprovided by the same code paths, and we can't rely on pinning and MFS to identify user data, because ipfs block put and ipfs dag put do not pin by default.
That is to say, disabling reproviding only for third-party content is not trivial: to only stop reproviding third-party website data, we would have to introduce separate datastores with different reproviding settings for first-party and third-party blocks in Kubo.
Content explicitly imported by the user (ipfs add, ipfs dag put, ipfs block put, ipfs dag import), or pinned by user, would be added/moved to first-party datastore.
A different, a bit simpler approach would be to keep a single datastore, but instead introduce a new default "auto" Reprovider.Strategy that:
always announces pinned content (+implicitly pinned MFS) → ensures user content is always reachable asap
announces the remaining blocks in cache (incl. ones that come from browsed websites) ONLY if a node was online for some time (we would add optionalDurationReprovider.UnpinnedDelay to allow users to adjust the implicit default)
TBD how we solve ipfs dag put and ipfs block put or other user content that is not pinned, but expected to "work instantly:
we could flip --pin in them to true → breaking change (may surprise users who expect these to not keep garbage around, may lead to services running out of disk space)
we could say that the ability for users to set Reprovider.Strategy to all and/or adjust Reprovider.UnpinnedDelay are enough here, ipfs routing provide exists, we could add --all to allow apps/users to manually trigger provide before Reprovider.UnpinnedDelay hits. (feels safer than A, no DoS, worst case a delay in announce on a cold boot)
A personal remark: It would be great if the user content that is expected to "work instantly" could make use of the fast provide operation. I think these commands are not blocking right now, correct? Using optimistic provide could justify making them blocking. But again, the provide strategy is a global switch. It would be great if the application layer could have more control over the publication process based on its specific needs.
2. Decreased Provider Record TTL
The idea here is to keep everything as is and just transmit the desired provider record TTL. The TTL would be calculated based on the nodes uptime only become a high number if the node has been up for X minutes/hours/days.
At the first glance this is a breaking protocol change but protobuf allows to add new fields without breaking old implementations (see comment from @guillaumemichel https://github.com/protocol/network-measurements/issues/49#issuecomment-1599099089). This means we could add the new field and nodes that understand it could adhere to the TTL the provider wants to set. Everyone else would just continue as before.
Some things to consider:
@aschmahmann remarked that we should be careful not opening a DoS vector
The TTL that the client wants to set should have an upper bound that the server enforces. This value should be set to the current TTL.
This strategy increases load on DHT servers because reprovides will happen more frequently. On the other hand, the number of provider records a server holds could decrease because they are garbage collected more frequently.
Measurements
TBD: How can we substantiate the proposal with numbers? some ideas
Meta comment: the changes proposed here won't improve things at the release time, it will take 6+ months to see improvement
we did not do server-side changes so far, so a lot of unknown unknowns
if we improve things on the client side, preferably, it should be at the time of looking for providers, not publishing providers, that will improve things at release time – lower risk, faster feedback loop, easier to run simulaitons, or back away
Kubo maintainers would prefer tackling this here (could be tackled in parallel, but realistically if we have to choose, this would be our bet)
So proposals that make provider lookup smarter should take precedence over ones described here (feel free to fill new issue and cc this one for discoverability)
Checklist
Description
Checklist
Description
This is the corresponding issue for a client side change to mitigate the impact of large numbers of unreachable providers.
Context
The context is explained in #9982.
Proposal
We identified two ways forward to address the impact of unreachable providers.
This GH issue is for proposal 2).
The discussion around 1) happens in #9982
1. Delay Provider Record Publication
The idea here is that we change the default configuration to only reprovide blocks after a Kubo node has had a consecutive uptime of
X
minutes/hours/days. The assumption is that nodes which have been online for a long time straight, will likely stay online and are stable.There are some nuances to consider here (copied from https://github.com/protocol/network-measurements/issues/49#issuecomment-1553665143):
Today, block reproviding is a global flag in Kubo (IPFS Desktop, Brave): we do not distinguish between blocks fetched while browsing websites (temporarily stored in the cache), and blocks imported by user by adding their own data to local node (either pinned, in MFS, or just in cache). Both types of data are stored and reprovided by the same code paths, and we can't rely on pinning and MFS to identify user data, because ipfs block put and ipfs dag put do not pin by default.
That is to say, disabling reproviding only for third-party content is not trivial: to only stop reproviding third-party website data, we would have to introduce separate datastores with different reproviding settings for first-party and third-party blocks in Kubo. Content explicitly imported by the user (
ipfs add
,ipfs dag put
,ipfs block put
,ipfs dag import
), or pinned by user, would be added/moved to first-party datastore.A different, a bit simpler approach would be to keep a single datastore, but instead introduce a new default "auto" Reprovider.Strategy that:
optionalDuration
Reprovider.UnpinnedDelay
to allow users to adjust the implicit default)ipfs dag put
andipfs block
put or other user content that is not pinned, but expected to "work instantly:--pin
in them totrue
→ breaking change (may surprise users who expect these to not keep garbage around, may lead to services running out of disk space)Reprovider.Strategy
toall
and/or adjustReprovider.UnpinnedDelay
are enough here,ipfs routing provide
exists, we could add--all
to allow apps/users to manually trigger provide beforeReprovider.UnpinnedDelay
hits. (feels safer than A, no DoS, worst case a delay in announce on a cold boot)A personal remark: It would be great if the user content that is expected to "work instantly" could make use of the fast provide operation. I think these commands are not blocking right now, correct? Using optimistic provide could justify making them blocking. But again, the provide strategy is a global switch. It would be great if the application layer could have more control over the publication process based on its specific needs.
2. Decreased Provider Record TTL
The idea here is to keep everything as is and just transmit the desired provider record TTL. The TTL would be calculated based on the nodes uptime only become a high number if the node has been up for
X
minutes/hours/days.At the first glance this is a breaking protocol change but protobuf allows to add new fields without breaking old implementations (see comment from @guillaumemichel https://github.com/protocol/network-measurements/issues/49#issuecomment-1599099089). This means we could add the new field and nodes that understand it could adhere to the TTL the provider wants to set. Everyone else would just continue as before.
Some things to consider:
Measurements
TBD: How can we substantiate the proposal with numbers? some ideas
References