Implement Provider Store Module

iand commented 1 year ago

ETA: 2023-07-31

There are currently no implementations of the Provider Store module.

Note that the same kind of mechanism can be used to store both IPNS records and Provider Records.

In a DHT Server node, the role of the Provider Store is to store Provider Records/IPNS records that have been assigned to self, and to serve them to whoever requests them. It is expected that not all provider records fit into memory. Usually provider records are republished every 22 hours, and some provider records are accessed much more than other (some are never requested).

The store should only record the data it can parse and not the full Protobuf record.

When storing data on disk, we must be careful not to block on IO. When doing IO, a new thread should be created, and once it is done doing IO, it should die and the operation should be resumed by the single worker.

Ideally the provider store would contain a (single/multi-layer?) cache. For instance it can be a LRU cache with L1 stored in memory so we don't have to do IO for popular Provider Records/IPNS keys.

iand commented 1 year ago

go-kademlia should define the interfaces it needs to support record storage and allow applications to pass an appropriate implementation.

iand commented 1 year ago

go-kademlia needs to provide record put/get functionality but the specific record types (provider/ipns) belong in go-libp2p-kad-dht as per the discussion https://github.com/plprobelab/go-kademlia/issues/34#issuecomment-1614516744

dennis-tra commented 1 year ago

Some observations:

the current provider manager is single-threaded as well
Are go-datastore implementations thread-safe?
I think IPNS records would belong to go-libp2p-kad-dht but provider records are always required
how do we handle the dependency on the peer store? that's a libp2p package.

iand commented 1 year ago

how do we handle the dependency on the peer store? that's a libp2p package.

Let's create a peer interface in go-kademlia and work in terms of that. KadDHT can then supply a libp2p peer

dennis-tra commented 1 year ago

I'm still trying to figure out which datastore abstractions make sense for our current goals. I think go-kademlia should be record type agnostic. It shouldn't be concerned about if it's a provider/peer/IPNS record. It should just verify that it's a valid record.

To further understand this, I went through the different PUT/GET operations of the current DHT implementation and which messages carry which record. Further, I want to know which messages we want to support in the go-kademlia vs. go-libp2p-kad-dht vs. boxo repositories.

go-libp2p-kad-dht defines:

PUT_VALUE - For storing IPNS records (also peer records?)
GET_VALUE - For extracting these records + closer nodes
ADD_PROVIDER - To store provider records (basically a specialized PUT_VALUE)
GET_PROVIDERS - To look up provider records + closer nodes (basically a specialized GET_VALUE)
FIND_NODE - To look up closer nodes only
PING (deprecated)

I briefly summarized what's happening for each of the messages to be able to compare it with https://github.com/libp2p/specs/tree/c733210b3a6c042d01f6b39f23c0c9a3a20d3e88/kad-dht. Feel free to skip this.

`PUT_VALUE`

Extract record -> yields a go-libp2p-record
Check if PUT key equals record key
Sanitise record to avoid storing arbitrary data
Validate record
Derive Datastore key
Acquire striped lock
Lookup existing record from datastore
Let the validator decide if existing record is "better"
Assign "received at" timestamp
marshal record to proto
Store record in datastore

`GET_VALUE`

Verify if there's a key to lookup the record for
Derive datastore key
Load data from datastore
Unmarshal record from datastore
Verify record age
If record is too old -> delete record from datastore
Check for better peers to query in routing table
Filter out self+requester from closer peers
Check peerstore for addresses of closer peers
Craft response message (assign record + closer peers)

`ADD_PROVIDER`

This is actually a PUT_VALUE operation.

Verify key in message
Extract "Provider Peers" from message
For every provider peer:
1. Add addresses to peer store
2. Check LRU cache and replace entry if it exists (refresh TTL)
3. Derive provider key
4. store record in datastore

`FIND_NODE`

This is actually a GET_VALUE operation:

Prepare response message with "cluster level" from received message (?)
Check for better peers to query in routing table
Filter out self+requester from closer peers
Add peers to response that we have addresses for

`GET_PROVIDERS`

Verify key in message
Lookup providers in provider store
Check LRU cache for provider set
Derive datastore key
Query data store for data
Verify record age
If record is too old -> delete record from datastore
Extract peerID from record key
Construct peer set
enrich peer set with peer addresses

The original Kademlia paper defines:

STORE
FIND_NODE
FIND_VALUE
PING

As I see it, this translates to:

STORE -> ADD_PROVIDER, PUT_VALUE
FIND_NODE -> FIND_NODE
FIND_VALUE -> GET_PROVIDERS, GET_VALUE
PING -> PING

Now the question: Should we define any message types in go-kademlia at all or solely operate on interfaces? The messages that we somehow should provide some logic for are the ones that the original Kademlia paper defines IMO. This includes PING. PING is deprecated in the libp2p Kademlia spec, but I think it should be part of go-kademlia because we plan to make it independent on libp2p and it’s part of the original spec.

If we don't want to force users of go-kademlia into a message format (e.g., the protobuf format that we're currently using) which I would opt for, then go-kademlia would need to hand control to the user on every message to, e.g., verify it's validity. The user then decides if it's a valid message, and so on.

Alternatively, we could implement a way to register message formats for STORE, FIND_NODE, FIND_VALUE operations + a way to determine the message type if we receive a request. go-kademlia could require at least one message format to be registered for any of the above types.

Different message formats would need to implement different things depending on their type:

STORE:

Generate a datastore key
Validate an incoming message
Extract data to store from the message

FIND_NODE:

Expose the target key (to calculate closer peers)
Validate an incoming message
Expose a way to set closer peers to the message format

FIND_VALUE:

Expose the target key (to calculate closer peers)
Generate a datastore key
Validate an incoming message
Expose a way to set closer peers to the message format

For example, go-libp2p-kad-dht would register GET_VALUE and GET_PROVIDERS for the FIND_VALUE operation. If a request comes in, go-kademlia would determine which type of message that is (based on a function or an interface that all messages must implement) and then would 1) verify the message, 2) generate the datastore key, 3) lookup the data in the local datastore 4) determine closer peers 5) craft a response message.

Each message type could optionally bring its own datastore implementation. If it doesn't bring one, a default one is used. This means, the GET_PROVIDERS data store logic could be slightly different from the GET_VALUE logic.

Now that I think about it, message formats are actually strongly coupled to record types. So a provider record is associated to GET_PROVIDERS ADD_PROVIDER. An IPNS/peer record is associated to GET_VALUE/PUT_VALUE. So users could register record types with go-kademlia. These record types could then define the message formats that are used to store and retrieve them.

iand commented 1 year ago

I haven't fully formed my thoughts yet but wanted to respond with a few notes before the weekend.

There is already the concept of a Message interface in https://github.com/plprobelab/go-kademlia/tree/main/network/message. I feel the protobuf specific messages will move to go-libp2p-kad-dht

The messages are associated with a protocol ID: https://github.com/plprobelab/go-kademlia/blob/main/network/address/address.go#L24

The Endpoint interface is the part that ties these together (https://github.com/plprobelab/go-kademlia/blob/main/network/endpoint/endpoint.go#L22) and I think the only essential method there is SendRequestHandleResponse.

A query sends a Message to an Endpoint using a Protocol, so I think the abstractions are there. But I think the specific implementations of the IPFS DHT messages, endpoints and protocols belong in go-libp2p-kad-dht. go-kademlia concerns itself wiih routing the message to the expected destination(s) and handling responses according the to the protocol and the query's handler (I am hand waving here, I don't know if it is doing it at the moment)

yiannisbot commented 1 year ago

One thing to clarify here is whether we need the PING functionality. From my perspective it would be useful to have, but I don't have context on why it was deprecated before. Was it to avoid overload, or abuse of some sort? Or was it just not needed? Maybe @aschmahmann can shed some light?

dennis-tra commented 1 year ago

Can't recall the exact reasons but here are some previous discussions @yiannisbot:

guillaumemichel commented 1 year ago

Great summary @dennis-tra !

I think that the datastore and PUT_VALUE GET_VALUE ADD_PROVIDER GET_PROVIDERS RPCs could be implemented in go-libp2p-kad-dht directly (and not go-kademlia), because these RPCs are specific to IPFS/libp2p. I see the go-kademlia (to be renamed) repo more as a tool to build DHTs defining their own RPCs, but go-kademlia wouldn't contain RPCs that aren't required for the purpose of routing (the only RPC that seems necessary is FIND_NODE).

Each message type could optionally bring its own datastore implementation. If it doesn't bring one, a default one is used.

I really like this principle. As long as the DHT implementation follows the specs of the network it is participating in, it can use any datastore implementation. It means that in the go-libp2p-kad-dht implementation we could have the default data store, but it could be possible to replace it with another one. And it would be possible to have different data stores for Provider and IPNS records.

There is already the concept of a Message interface in https://github.com/plprobelab/go-kademlia/tree/main/network/message. I feel the protobuf specific messages will move to go-libp2p-kad-dht

:+1:

The messages are associated with a protocol ID: https://github.com/plprobelab/go-kademlia/blob/main/network/address/address.go#L24

The Protocol ID is only required by libp2p. We can probably remove it from go-kademlia (or make it a parameter of the libp2p module, if we keep it).

go-kademlia concerns itself wiih routing the message to the expected destination(s) and handling responses according the to the protocol and the query's handler (I am hand waving here, I don't know if it is doing it at the moment)

I agree that go-kademlia concerns itself with routing the messages to the expected destinations. However handling responses should be done by the DHT implementation (e.g go-libp2p-kad-dht), because this behavior depends on the protocol. go-kademlia could have a generic server interface, but the specific server implementations (such as basicserver) should move away.

One thing to clarify here is whether we need the PING functionality.

I am not sure what a specific DHT PING RPC would achieve. If we need to test the connectivity to a host, we can simply use libp2p or ICMP PING. And if we want to check whether a node can actually respond to a DHT request, we can simply send a FIND_NODE request to this node. But maybe there was another use to this RPC that I missed?

dennis-tra commented 1 year ago

I think that the datastore and PUT_VALUE GET_VALUE ADD_PROVIDER GET_PROVIDERS RPCs could be implemented in go-libp2p-kad-dht directly (and not go-kademlia), because these RPCs are specific to IPFS/libp2p. I see the go-kademlia (to be renamed) repo more as a tool to build DHTs defining their own RPCs, but go-kademlia wouldn't contain RPCs that aren't required for the purpose of routing (the only RPC that seems necessary is FIND_NODE).

👍 👍 👍

The Protocol ID is only required by libp2p. We can probably remove it from go-kademlia (or make it a parameter of the libp2p module, if we keep it).

👍

I agree that go-kademlia concerns itself with routing the messages to the expected destinations. However handling responses should be done by the DHT implementation (e.g go-libp2p-kad-dht), because this behavior depends on the protocol. go-kademlia could have a generic server interface, but the specific server implementations (such as basicserver) should move away.

This is something I came across last week. I was wondering why go-kademlia would need to define a server interface at all? Do we want to enqueue requests to the event loop as well? Looking at the handler functions in go-libp2p-kad-dht this seems like an unnecessary indirection (as long as the routing table + data store are thread-safe).

I am not sure what a specific DHT PING RPC would achieve. If we need to test the connectivity to a host, we can simply use libp2p or ICMP PING. And if we want to check whether a node can actually respond to a DHT request, we can simply send a FIND_NODE request to this node. But maybe there was another use to this RPC that I missed?

ICMP is used to test if a host is reachable not necessarily if the process you're looking for is online. The libp2p ping wouldn't count against any resource manager kad/dht stream limits (which could be a good thing?). Generally, I think a FIND_NODE request is a valid way to test if a remote peer speaks the protocol. A FIND_NODE request uses few more CPU cycles (the remote peer needs to determine the closest peers it knows to a specific key) than a plain PING request 🤷

guillaumemichel commented 1 year ago

ICMP is used to test if a host is reachable not necessarily if the process you're looking for is online. The libp2p ping wouldn't count against any resource manager kad/dht stream limits (which could be a good thing?). Generally, I think a FIND_NODE request is a valid way to test if a remote peer speaks the protocol. A FIND_NODE request uses few more CPU cycles (the remote peer needs to determine the closest peers it knows to a specific key) than a plain PING request 🤷

Agreed that a dedicated DHT PING RPC would make sense, because it would use less CPU cycles, and could integrate with a resource manager and be used for back pressure e.g if a node is very busy, answering the PING isn't its first priority so the PING response will be delayed, and the remote peer that has sent the PING request may drop the node from its routing table.

The only benefit of using a FIND_NODE RPC to perform a PING is that it prevents misconfigured nodes to be added to other peer's routing tables. If a node correctly implements the PING response, but not the FIND_NODE response, it will propagate in other peers' routing tables even though it is useless to the network.

Overall, I think we should try to optimize for performance (CPU cycles), so we could reintroduce the DHT PING request.

PS: reading through the commit histories, I couldn't figure out a strong reason why PING was deprecated.

probe-lab / go-kademlia