Optimizing (or remove) the provider delay interval for DHT lookups used by Bitswap

aschmahmann commented 2 years ago

We have had a delay inside of go-bitswap where we wait for Bitswap broadcasts before we do a content routing lookup.

This has existed for a long time, and DHT performance has gotten much better in the meanwhile. It seems likely that we can drop this number without causing much more load on major use cases, although some analysis might be needed.

If we find analysis is a pain here, we could at least expose this as a Bitswap.Internal config.

The option is: https://github.com/ipfs/go-bitswap/blob/b6f0cc7c83aaa27a39cc7e1b16ee34bba2d8b5b8/bitswap.go#L73

guseggert commented 2 years ago

Consider also determining the delay dynamically so this doesn't have to be manually tweaked.

Also, ensure the metrics necessary to determine this value are exposed, and the process for determining the value is documented, so it can be tweaked over time.

proto-sid commented 2 years ago

@guseggert In the interest of being iterative can we just push the change through (drop this delay) and add the dynamic value support later?

cc: @iand

yiannisbot commented 2 years ago

I'm definitely 100% in favour of what @aschmahmann suggests!

We've done some extensive measurement studies and found that the DHT can, in many cases (mostly depending on geographic location of the client/requestor), complete the lookup in less than 1s - see attached fig. AFAIK, the bitswap delay is set to 1s (notice the shift by 1s between the leftmost and the middle fig), which means that it adds more than 100% of delay before the lookup is complete, in case the bitswap broadcast is unsuccessful.

Screenshot 2022-04-06 at 09 22 55

This obviously begs the question of how successful is the Bitswap discovery - my guess is that:

for normal clients discovery success rate not high, if not negligible,
for clustered peers discovery success rate might be higher and generally worth keeping.
If true, this ^ indicates that clustered peer operators would have to tweak that manually. We could have a set of recommendations for those cases.

Could we have a hybrid of starting the bitswap discovery and the DHT lookup in parallel, until we figure out more details? I believe the config is this one: https://github.com/ipfs/go-bitswap/blob/dbfc6a1d986e18402d8301c7535a0c39b2461cb7/internal/defaults/defaults.go#L10 which needs to be set to 0?

The study we plan to carry out in order to make a justified decision is described here: https://github.com/protocol/network-measurements/blob/master/RFMs.md#rfm-16--effectiveness-of-bitswap-discovery-process - any input more than welcome.

iand commented 2 years ago

I will take a look at starting bitswap discovery and the DHT lookup in parallel

mrd0ll4r commented 2 years ago

I also think doing them in parallel would be good. I've measured Bitswap response times (from a well-connected server), and found them to be generally below one second:

(Take that figure with a grain of salt)

yiannisbot commented 2 years ago

Thanks @mrd0ll4r ! Do you happen to have (even rough) estimates on the success rate of the bitswap discovery process? I.e., out of how many bitswap requests do the successful responses you've measured in the graph above come from?

mrd0ll4r commented 2 years ago

Unfortunately not directly, I don't have that dataset anymore. But I do still have a few graphs: We did this with 14k peers and 850 CIDs, so we sent roughly 12 million requests in total. Out of the 850 CIDs, ~650 were available at least once.

This is above graph, but with counts instead of densitites: I actually have to read the numbers from the graph (as the dataset is no more), but I'd guess that less than... 20k(?) responses were positive.

We also counted the availability by CID and by peer, both of which are skewed:

Now, while 20k out of 12 million doesn't sound great, we've also found that, by now, most peers understand and respect SEND_DONT_HAVE, which means that we don't have to rely solely on timeouts.

Anyway. Take all this with a grain of salt. I don't have this running regularly, and some of the CIDs were hand-picked, so it's all... not the most clear.

guseggert commented 2 years ago

@guseggert In the interest of being iterative can we just push the change through (drop this delay) and add the dynamic value support later?

cc: @iand

Definitely, most important thing is to have the instrumentation in place to observe the impact over time.

iand commented 2 years ago

We're running an experiment in Thunderdome for various values for the provider delay (0/20/50/100/200/500/1000 ms). I wrote up some early findings on Notion

The TTFB was markedly worse for delay values of 0ms, 20m and 50ms. Possibly we have better TTFB around 500 but I'd want to repeat the experiment with more values around that mark to get more confidence in that. See the Notion doc for more graphs.

Snapshot_2022-09-07_15-29-58

yiannisbot commented 1 year ago

Thanks looking into this @iand ! Difficult to explain what's going on from these results :-D Do you have any explanation?

Some questions regarding the experiment setup that might help us get closer to what's happening:

These experiments are executed from the gateways, right? Are all experiments from the same gateway? Would it make sense to request the same set of CIDs from different gateways simultaneously?
Do you request the same CIDs for all experiments/latencies, or different random ones? Where do you find these CIDs? And how many CIDs do you request per experiment?
If you're using the same set of CIDs it would make sense to drop the connections where you've found the requested CID before the next experiment - otherwise, Btiswap will find the CID directly (and have low latency in the graphs), but without this being a valid result.
- Not dropping the connections could explain the first experiment (0ms) taking a long time, but all subsequent ones would be fast, which is not the case.
Do you leave any time between the experiments? Or are they executed directly back to back? I'm assuming the sequence is the one mentioned above, i.e., 0ms, 20ms, 50ms, 100ms, etc., right?
- If you're not dropping the newly-found connections after each experiment, could it be that it takes the gateway nodes some time (in the order of ~500ms) to set up the new connection and therefore, after that response is rapid (i.e., at the bitswap discovery level)?
Do you have stats regarding the percentage of CIDs resolved through Bitswap vs the DHT?

We've done some experiments too and indeed a value around 200ms or a bit more would be recommended: if content is found through Bitswap, the node should have an answer after this period of time (roughly). We couldn't draw conclusions though as it turned out that the group of CIDs we were requesting were NFTs and therefore resolved to the nft.storage cluster, so no DHT was ever involved :-D Need to repeat the experiments with a different set of CIDs.

iand commented 1 year ago

Thanks looking into this @iand ! Difficult to explain what's going on from these results :-D Do you have any explanation?

This experiment is still running so I will have some additional data to add which I will do when I've analysed it a little.

Some questions regarding the experiment setup that might help us get closer to what's happening:
* These experiments are executed from the gateways, right? Are all experiments from the same gateway? Would it make sense to request the same set of CIDs from different gateways simultaneously?

Thunderdome starts new instances of Kubo for each experiment. In this case it's 8 new instances each with a different config setting for provider delay. During the experiment they all receive the same requests simultaneously.

* Do you request the same CIDs for all experiments/latencies, or different random ones? Where do you find these CIDs? And how many CIDs do you request per experiment?

All instances in an experiment get exactly the same traffic. It is a feed from the live gateways that we throttle to the desired rate. In the experiment I mentioned above we limited to 30rps. However that is excessive for these configurations and I re-ran the experiment over the weekend at 20rps.

* If you're using the same set of CIDs it would make sense to drop the connections where you've found the requested CID before the next experiment - otherwise, Btiswap will find the CID directly (and have low latency in the graphs), but without this being a valid result.

We don't control the CIDs - they are whatever the live gateways receive. We filter to just path requests though. The goal is to simulate the real load a gateway is subjected to. That includes 404s, repeated requests, bad cids, range requests etc.

  * Not dropping the connections could explain the first experiment (0ms) taking a long time, but all subsequent ones would be fast, which is not the case.

* Do you leave any time between the experiments? Or are they executed directly back to back? I'm assuming the sequence is the one mentioned above, i.e., 0ms, 20ms, 50ms, 100ms, etc., right?

They are run in parallel, continuously until we stop the experiment.

  * If you're not dropping the newly-found connections after each experiment, could it be that it takes the gateway nodes some time (in the order of ~500ms) to set up the new connection and therefore, after that response is rapid (i.e., at the bitswap discovery level)?

* Do you have stats regarding the percentage of CIDs resolved through Bitswap vs the DHT?

No, I would like this too so I'll be looking at the best way to get this data.

yiannisbot commented 1 year ago

Interesting setup 👌🏽 But I still can't explain where this performance difference comes from. Are these kubo nodes connected to the gateways? I guess they're not.

dennis-tra commented 1 year ago

@iand also a couple of questions from my side

Am I right that the Bitswap resolution is not cancelled when the ProviderSearchDelay is reached?
Is this the only option that you adjust for the experiments or do you have made any other code changes? If so, could you link the branch/fork/repo?

Disclaimer: I'm not super familiar with the Bitswap internals.

I just had a look through the Bitswap code and think that the provSearchDelay propagates to the initialSearchDelay of a Bitswap client session here.

This initialSearchDelay is then used for a timer that then calls this method which in turn seems broadcast WANT_HAVE's to connected clients here. However, this could happen very frequently for low provSearchDelay values as the resetIdleTick will reset the timer to the initialSearchDelay as long as the client hasn't received a block.

I'm probably missing something, so would appreciate it, if you could shed some light on that. I think I just want to draw attention to unintended side-effects that could arise from decreasing that delay.

lidel commented 1 year ago

Was this closed by mistake @Jorropo @BigLep?

Reopening, as (iiuc) https://github.com/ipfs/kubo/pull/9053 only added an internal configuration option (Internal.Bitswap.ProviderSearchDelay), and did not remove the delay.

It is still set to 1s (see docs from #9526) which means:

bitswap won't ask DHT and HTTP router (parallel querying added in https://github.com/ipfs/kubo/pull/9475) until 1s passed
this creates a precarious setup where we are sabotaging initial load of websites by adding artificial 1s delay to most of first loads. With IPNI (HTTP router), we should be able to cut entire 1s from initial requests, no?

Perhaps after Kubo 0.18 ships with Routing.Type=auto we should run benchmarks again to see if the addition of IPNI put a finger on the scale? Or does IPNI (cid.contact) changes things enough for us to flip default to 0s before final Kubo 0.18 release?

guillaumemichel commented 1 year ago

We got more insights on this 1s timeout from the Effectiveness of Bitswap Discovery Process study. I am fully in favor of setting the default value of ProviderSearchDelay from 1s to 0. The associated tradeoffs and overheads associated are discussed here.

Moreover, reducing the number of directly connected peers (lowering System.ConnsInbound) https://github.com/ipfs/kubo/pull/9483 will certainly result in a sharp drop of the Bitswap discovery success rate. It isn't a very big deal, as the DHT is much faster than it was and we now have IPNI.

The number of CIDs that Bitswap isn't able to discover is expected to increase significantly it will broadcast the request to ~8x less peers. These unsuccessful requests will wait for ProviderSearchDelay=1s before contacting the DHT/IPNI, which isn't desirable.

The number of additional messages generated by starting a DHT walk concurrently to Bitswap is expected to be minimal (~3). I didn't study IPNI requests, but intuitively the number of additional sent message should be minimal too (1?). However, I don't know whether the Indexers have the capacity to serve ALL requests, or they target only the content that isn't found by Bitswap (in this case, the ProviderSearchDelay can still be reduced from 1s) @willscott

willscott commented 1 year ago

Do we have a sense of what multipler of traffic to DHT, IPNI will be associated with removing the search delay?

guillaumemichel commented 1 year ago

We measured that currently 93.65% are successful within the first second after starting Bitswap, implying that only 6.35% of the request make it to the DHT. Starting the DHT concurrently to Bitswap is expected to increase the number of DHT FindContent requests by 16x ( $\frac{1}{6.35\%}$).

Source: Effectiveness of Bitswap Discovery Process report

willscott commented 1 year ago

cc @masih

I believe IPNI can handle a 10x increase on reads.
Do we have a sense of if there are intermediate solutions that result in less read amplification? e.g. try the bitswap peer you most recently got a block from (or maybe the set of connected bitswap peers you've gotten a block from w/i the last second), and if that opportunistic query doesn't work go to IPNI/DHT immediately?

guillaumemichel commented 1 year ago

An easy intermediate solution would be setting the ProviderSearchDelay to 200 ms for instance. In $74.74\%$ of Bitswap requests, the content is fetched in 200 ms (for our experiment setting, latency may vary according to node connectivity/location of the content), starting a DHT walk after 200 ms instead of 1s would increase the DHT traffic ~4x. However, this value (200 ms here) should be adaptive and correspond to the ~75th percentile of the content discover+fetch latency for each individual node.

e.g. try the bitswap peer you most recently got a block from

That would be a solution to reduce Bitswap broadcast while keeping a high discovery success rate, not to reduce the Content Routers (DHT + IPNI) load. Unless we wait for the response from these peers before asking the Content Routers, in which case the situation is similar as the above.

More suggestions to limit Bitswap broadcast are described here

lidel commented 1 year ago

Thanks!

In effort to move forward, I've opened https://github.com/ipfs/kubo/pull/9530 to set ProviderSearchDelay to 0s.

Rationale on the PR, tldr is that default settings in Kubo 0.18 maintains fewer connections (iiuc, this decreases number of bitswap lookup spam messages by ~700), so adding ~4 DHT+IPNI lookups sounds like an acceptable trade-off.

Would appreciate review – we would like to resolve this before final Kubo 0.18 release (either0s or something like 400ms)

BigLep commented 1 year ago

Great stuff here @guillaumemichel ! I really enjoyed reading Effectiveness of Bitswap Discovery Process.

Per https://github.com/ipfs/kubo/pull/9530#pullrequestreview-1238422144, I'm in agreement to drop the delay to 0s. Kubo nodes with default configuration are going to experience dramatically less network traffic given https://github.com/ipfs/kubo/pull/9483 even with the extra DHT / IPNI requests. cid.contact is the one taking the extra read load, but given they can handle the extra traffic, I think we're good to roll this out.

Also, can someone confirm that we have a metric in Kubo that will make it clear what proportion of CIDs are satisfied with Bitswap discovery vs. DHT vs. IPNI. That will be the key metric we'll want to look at when this gets deployed to the ipfs.io gateway.

@guillaumemichel : do you think it's reworth running some number with smaller connected peer amounts since that is what Kubo is moving towards in 0.18?

guillaumemichel commented 1 year ago

Also, can someone confirm that we have a metric in Kubo that will make it clear what proportion of CIDs are satisfied with Bitswap discovery vs. DHT vs. IPNI. That will be the key metric we'll want to look at when this gets deployed to the ipfs.io gateway.

This would be awesome to have!

@guillaumemichel : do you think it's reworth running some number with smaller connected peer amounts since that is what Kubo is moving towards in 0.18?

It would be useful to benchmark the discovery+fetch latency (ipfs get for a single block) to compare kubo 0.17 (850 peers, 1s ProviderSearchDelay) and kubo 0.18 (100 peers, 0s ProviderSearchDelay). The discovery+fetch latency is expected to increase slightly, we need to make sure that it is still acceptable.

lidel commented 1 year ago

Also, can someone confirm that we have a metric in Kubo that will make it clear what proportion of CIDs are satisfied with Bitswap discovery vs. DHT vs. IPNI. That will be the key metric we'll want to look at when this gets deployed to the ipfs.io gateway.

I don't believe we have that atm (only saw prometheus stats for bandwidth/block count for bitswap itself, no routing metrics) – cc @guseggert / @ajnavarro to confirm.

Notes from today's Kubo standup:

We would like to improve things, but removal of the delay feels too risky: we don't know what will be the impact on IPNI, DHT servers, and node CPU/Bandwidth.
- cid.contact asked us to park https://github.com/ipfs/kubo/pull/9530 for now.
There is more to the story than latency measured as TTFB.
- CPU and Bandwidth are often more important, especially Bandwidth spent on routing (DHT/bitswap), as it translates to operational costs.
- Gateway at ipfs.io/dweb.link does run with default settings. For example, they have custom connection manager limits of 3000/5000.
- We want to understand what the default settings in Kubo should be. Fine also to test custom gateway setup, but we need to also test with the current defaults.
- Given the CID churn, we either need fresh bitswap sniff, OR use TOP 1000 requested content paths at ipfs.io/dweb.link in the week or two.
- We suggest using gateway paths instead of disjoint CIDs because that simulates real world use better: resolving a content path produces more than a single lookup – getting c from /ipfs/cid/a/b/c on empty data store could produce 4 remote lookups.

Proposal: run tests again and look at TTFB next to CPU and Bandwidth utilization.

Test ProviderSearchDelay=1s, 500ms, [200ms, 100ms, 50ms, 20ms], 0s with:
- kubo 0.17: Routing.Type=dht (default ConnMgr at 600/900, ~800 peers)
- kubo 0.18-rc2: Routing.Type=auto (DHT+IPNI at cid.contact) (default ConnMgr at 32/96, ~100peers)
- kubo 0.18-rc2: Routing.Type=dht (no IPNI) (default ConnMgr at 32/96)
- kubo 0.18-rc2 simulating ipfs.io/dweb.link config (custom ConnMgr at 3000/5000)

This should give us data to see if there are any unexpected CPU/Bandwidth spikes/savings.

@guillaumemichel thoughts? I don't know how flexible id your setup / how much work this involved.

Is this something probelab will be able to test before we ship 0.18 (next two-three weeks)? Or should we park this and invest time into building tests and adding mentioned metrics to also understand which CIDs are satisfied with Bitswap discovery vs. DHT vs. IPNI?

If we can't benchmark this any time soon, we had a mild consensus during Kubo standup about to lower ProviderSearchDelay to conservative 500ms for Kubo 0.18 – imo worth doing, still better than 1s, but lower risk given https://github.com/ipfs/kubo/issues/8807#issuecomment-1240426897 and lack of better data.

guillaumemichel commented 1 year ago

Proposal: run tests again and look at TTFB next to CPU and Bandwidth utilization.

Test ProviderSearchDelay=1s, 500ms, [200ms, 100ms, 50ms, 20ms], 0s with:

kubo 0.17: Routing.Type=dht (default ConnMgr at 600/900, ~800 peers)

kubo 0.18-rc2: Routing.Type=auto (DHT+IPNI at cid.contact) (default ConnMgr at 32/96, ~100peers)

kubo 0.18-rc2: Routing.Type=dht (no IPNI) (default ConnMgr at 32/96)

kubo 0.18-rc2 simulating ipfs.io/dweb.link config (custom ConnMgr at 3000/5000)

That is a perfect usecase for Thunderdome! @iand can we easily measure CPU and Bandwidth in addition to the TTFB?

We won't be able to start the measurements this week, but we can certainly have them done before two-three weeks.

If we can't benchmark this any time soon, we had a mild consensus during Kubo standup about to lower ProviderSearchDelay to conservative 500ms for Kubo 0.18 – imo worth doing, still better than 1s, but lower risk given https://github.com/ipfs/kubo/issues/8807#issuecomment-1240426897 and lack of better data.

Agree with that if we don't have better results by then.

guillaumemichel commented 1 year ago

@lidel @ajnavarro Would it be easy to set a different Provider Delay for the DHT and for the Indexers? This would allow us to set the Provider Delay to 0 for the DHT, while keeping it to 1s (or 500ms) for the Indexers, limiting the Indexers' load.

ajnavarro commented 1 year ago

@guillaumemichel yeah, you can set your own routing config and use the ExecuteAfter parameter for the parallel router to execute delegated routing a bit after the DHT if needed: https://github.com/ipfs/kubo/blob/master/docs/config.md#routingrouters-parameters

iand commented 1 year ago

Proposal: run tests again and look at TTFB next to CPU and Bandwidth utilization.

Test ProviderSearchDelay=1s, 500ms, [200ms, 100ms, 50ms, 20ms], 0s with:

kubo 0.17: Routing.Type=dht (default ConnMgr at 600/900, ~800 peers)

kubo 0.18-rc2: Routing.Type=auto (DHT+IPNI at cid.contact) (default ConnMgr at 32/96, ~100peers)

kubo 0.18-rc2: Routing.Type=dht (no IPNI) (default ConnMgr at 32/96)

kubo 0.18-rc2 simulating ipfs.io/dweb.link config (custom ConnMgr at 3000/5000)

That is a perfect usecase for Thunderdome! @iand can we easily measure CPU and Bandwidth in addition to the TTFB?

We won't be able to start the measurements this week, but we can certainly have them done before two-three weeks.

If we can't benchmark this any time soon, we had a mild consensus during Kubo standup about to lower ProviderSearchDelay to conservative 500ms for Kubo 0.18 – imo worth doing, still better than 1s, but lower risk given #8807 (comment) and lack of better data.

Agree with that if we don't have better results by then.

Yes we measure CPU already. Bandwidth can be added, though I might need to grab those metrics from AWS separately

willscott commented 1 year ago

cross posting from #9530 since i missed it here. https://github.com/ipfs/kubo/pull/9530#issuecomment-1385777419

no need to not increase indexer usage at the same time as DHTs.

Presumably any increase in queries is an equal concern for DHTs as for indexers, so we are okay with the rollout happening together. we'd like to be involved in understanding when those steps happen so we have the same visibility as the stewards.

lidel commented 1 year ago

Thunderdome Experiment Summary: provdelayrouting confirmed that removing delay does more harm than good:

P99 time to first byte is significantly worse in v0.18 using dht routing when the provider delay is reduced. Comparing v0.18 and v0.17 at 500ms delay v0.18 has a P99 of 22.9s, compared to 16.9s for v0.17 (+35%) and at 0ms delay v0.18 has a P99 of 21.5s compared to 9.2s for v0.17 (+133%). The median values show a similar decrease in performance.
We've discussed it during 2023-02-06 Content Routing WG #5 today and decided to close https://github.com/ipfs/kubo/pull/9530 for now
- Additional work around the way routing works need to happen, around volume of routing broadcasts and the order (bitswap vs DHT vs IPNI).

guillaumemichel commented 1 year ago

The Thunderdome Experiment Summary: provdelayrouting measurements and report were performed by @iand 😉

BigLep commented 1 year ago

Note that this work is blocked on "go-libp2p-kad-dht refactor work" (is there a tracking issue for this @guillaumemichel )? Once that is completed, this can be attempted again.

ipfs / kubo

Optimizing (or remove) the provider delay interval for DHT lookups used by Bitswap #8807