libp2p / rust-libp2p

The Rust Implementation of the libp2p networking stack.
https://libp2p.io
MIT License
4.38k stars 912 forks source link

kad: get_providers() sometimes resolves with FinishedWithNoAdditionalRecord and no closest peers #5418

Open nazar-pc opened 2 months ago

nazar-pc commented 2 months ago

I have an app that does a lot of get_providers() calls via Kademlia.

Most of the time they finish successfully, but occasionally the first and only query result is GetProvidersOk::FinishedWithNoAdditionalRecord with empty list of closest peers, which seems wrong both because the key is actually present on the network and due to closest peers list being empty.

It would be reasonable if the key was not found and I got a list of closest peers, but the fact that list of closest peers is empty while Kademlia k-buckets are not empty tells me something is off here, but I'm having a hard time tracing what, where and why.

Sometimes I first get GetProvidersOk::FoundProviders with empty list followed by GetProvidersOk::FinishedWithNoAdditionalRecord with a single peer ID, again for a key that definitely exists on the network.

Any suggestions for debugging steps here?

libp2p-kad 0.45.3 (latest at the moment of writing), disjoint_query_paths: true

guillaumemichel commented 2 months ago

I am not very familiar with the disjoint paths query. It seems to be because the ClosestPeersIters are empty, but I am not sure why?

Does the same issue occur if you use another query routing technique (closest or fixed)?

nazar-pc commented 2 months ago

Does the same issue occur if you use another query routing technique (closest or fixed)?

I'm just calling https://docs.rs/libp2p-kad/0.45.3/libp2p_kad/struct.Behaviour.html#method.get_providers in this case. I guess I can try to modify the code to try it out, but it is slower to test that way.

Is your node connected with the rest of the network when the query fails?

I'm quite sure that I am. There are many concurrent queries happening and only a few end up like this and not even every time, sometimes I have 1000 successful in a row.

Is it possible that the querying node is storing the provider record itself, and that all entries are expired?

Not possible, the node making requests in thsi particular situation is such that physically and by design has no records.

Do you enable Caching? If yes, was the key requested already by the node?

I did not and wasn't even aware of such cache :sweat_smile:

guillaumemichel commented 2 months ago

Do you enable Caching? If yes, was the key requested already by the node?

I did not and wasn't even aware of such cache 😅

By default you should get 1 peer in the FinishedWithNoAdditionalRecord.

Does the same issue occur if you use another query routing technique (closest or fixed)?

I'm just calling https://docs.rs/libp2p-kad/0.45.3/libp2p_kad/struct.Behaviour.html#method.get_providers in this case. I guess I can try to modify the code to try it out, but it is slower to test that way.

You can simply try to set disjoint_query_paths: false, and see if the problem still occurs.

Is the code you are working on public?

nazar-pc commented 2 months ago

You can simply try to set disjoint_query_paths: false, and see if the problem still occurs.

I will.

Is the code you are working on public?

Yes, all of it, but it is a big project and might be a bit tricky as a demo of this issue. Let me do some experiments and post results here.

nazar-pc commented 2 months ago

It does happen with and without disjoint query paths.

A smaller reproduction using https://github.com/subspace/subspace/tree/debug-piece-not-found-libp2p-wip is this:

cargo run --example benchmark -- --bootstrap-nodes=/dns/bootstrap-0.gemini-3h.subspace.network/tcp/30533/p2p/12D3KooWK7NuL4S6aEdy5gELnvhCGo6EyrWVARnBy7W4AJTVkaF1 --bootstrap-nodes=/dns/bootstrap-1.gemini-3h.subspace.network/tcp/30533/p2p/12D3KooWQK33n2raSXzjH8JyqbFtczBmbwZiK9Tpicdw3rveJesj --protocol-version=0c121c75f4ef450f40619e1fca9d1e8e7fbabc42c895bc4790801e85d5a91c34 --out-peers=100 --pending-out-peers=100 parallel --parallelism-level=100 --max-pieces=256 --start-with=256 --retries=0

You'll see some Piece not found and for some of them (not all, keys are printed in logs) you'll see the only records matched by key are these:

2024-05-30T14:37:33.057585Z ERROR subspace_networking::node_runner: Get providers query yielded no results key="90b2ce0508c101000000000000" closest_peers=0 closest_peers=[]
2024-05-30T14:37:33.057618Z  WARN benchmark: Piece not found. piece_index=449 key="90b2ce0508c101000000000000" pure_duration=9.319213208s full_duration=180.36635061s

While expectation is to see something like this (another one that was not found):

2024-05-30T14:37:27.970135Z ERROR subspace_networking::node_runner: Get providers query yielded 0 results key="90b2ce05089201000000000000"
2024-05-30T14:37:30.060845Z ERROR subspace_networking::node_runner: Get providers query yielded no results key="90b2ce05089201000000000000" closest_peers=1 closest_peers=[PeerId("12D3KooWLjVBE9kzygAVTL1qHfEF8WpupjkUqgqnpUYZTRPvjL3y")]
2024-05-30T14:37:50.274517Z  WARN benchmark: Piece not found. piece_index=402 key="90b2ce05089201000000000000" pure_duration=30.749968564s full_duration=197.583262458s

Error logs are not errors, I just made them errors so I have different colors :upside_down_face:

subspace-networking is the crate that encapsulates our libp2p-based networking stack used for these examples.

guillaumemichel commented 2 months ago

What is the size of the network?

Given the high duration of the lookups (I am not sure which duration should be considered), it seems like the node was able to establish some connections and perform the lookup even though it didn't find the record. So it should have some peers to return in the GetProvidersOk::FinishedWithNoAdditionalRecord.

nazar-pc commented 2 months ago

What is the size of the network?

We have over 8000 consensus nodes (clients) and at least the same number (likely higher) farmers that are serving data to clients.

So it should have some peers to return in the GetProvidersOk::FinishedWithNoAdditionalRecord.

That is my understanding as well.