ipfs / boxo

A set of reference libraries for building IPFS applications and implementations in Go.
https://github.com/ipfs/boxo#readme
Other
184 stars 83 forks source link

blockstore: switch from Cid to Multihash #361

Open kevina opened 5 years ago

kevina commented 5 years ago

We should change the Blockstore interface once it is updated to store blocks my Multihash and not the full Cid. One way to go is to remove the notation of Cids from a block, however the Cid is very useful in that context and doing so will likely break things. Instead I propose we change the types of two method (1) AllKeysChan will return multhashes instead of Cid, this is basically required because we no longer have the necessary information to reconstruct the Cid, (2) DeleteBlock will take a multihash, this is not necessary required but it is more honest, because if we delete a block from a Cid we are not just deleting that Cid, but all Cids with that multihash.

The proposed interface is thus: Note: See https://github.com/ipfs/boxo/issues/361 for the updated interface

type Blockstore interface {
    DeleteBlock(mh.Multihash) error
    Has(*cid.Cid) (bool, error)
    Get(*cid.Cid) (blocks.Block, error)

    // Put puts a given block to the underlying datastore
    Put(blocks.Block) error

    // PutMany puts a slice of blocks at the same time using batching
    // capabilities of the underlying datastore whenever possible.
    PutMany([]blocks.Block) error

    // AllKeysChan returns a channel from which
    // the CIDs in the Blockstore can be read. It should respect
    // the given context, closing the channel if it becomes Done.
    AllKeysChan(ctx context.Context) (<-chan mh.Multihash, error)

    // HashOnRead specifies if every read block should be
    // rehashed to make sure it matches its CID.
    HashOnRead(enabled bool)
}

Note that since a Cid is part of a block the Get needs the full Cid to reconstruct the block. I have miked feeling about the Has method. I was thinking that a "smart" blockstore could use the extra information, but I can't think of a use-case (note that the idstore works at the multihash level so it doesn't need to full Cid). One argument is it should take the full Cid so it has the same parameter as the Get method.

One question: Why is DeleteBlock named that way? Is there a reason it is not just Delete? If this is purely for historical reasons, I propose we rename this to just Delete since we are changing the API anyway.

As it may sometime be useful to retrieve (or even add) a block at the multihash level, I also propose we introduce a new Codec to mean "unknown" to the Cid standard, for situations when all we have is the multihash but we need a full Cid. Trying to do anything with the data of a block with an "unknown" codec would result in an error. The "raw" codec could be used but it has a slightly different meaning, in particular it means that the data does not have any structure, rather than the structure being unknown. One use for this codec is for the output of ipfs refs local. We could just display the multihash, but that can easily be confused with CidV0. If instead we convert them to CidV1 with a "unknown" codec there will be no confusion and this will also allow us to display in different bases.

Thoughts?

Stebalien commented 5 years ago

Why is DeleteBlock named that way? Is there a reason it is not just Delete? If this is purely for historical reasons, I propose we rename this to just Delete since we are changing the API anyway.

Doesn't appear to be any good reason in the git log.

As it may sometime be useful to retrieve (or even add) a block at the multihash level, I also propose we introduce a new Codec to mean "unknown" to the Cid standard, for situations when all we have is the multihash but we need a full Cid.

I really would just use the raw multicodec. From a type-theory standpoint, it's effectively the "top" type. Regardless of what we do, we'll break userspace here.

kevina commented 5 years ago

I really would just use the raw multicodec.

My changing the Cid to a Multicodec in a block will will be losing information. This will require a lot of careful thought so everything still works. I am against doing it at this stage.

kevina commented 5 years ago

After thinking about it a while here is what I think the interface should look like now:

type Blockstore interface {
    Delete(mh.Multihash) error
    Has(mh.Multihash) (bool, error)
    Get(*cid.Cid) (blocks.Block, error)

    // GetSize returns the Multihash's mapped BlockSize
    GetSize(mh.Multihash) (int, error)

    // Put puts a given block to the underlying datastore
    Put(blocks.Block) error

    // PutMany puts a slice of blocks at the same time using batching
    // capabilities of the underlying datastore whenever possible.
    PutMany([]blocks.Block) error

    // AllKeysChan returns a channel from which
    // the CIDs in the Blockstore can be read. It should respect
    // the given context, closing the channel if it becomes Done.
    AllKeysChan(ctx context.Context) (<-chan mh.Multihash, error)

    // HashOnRead specifies if every read block should be
    // rehashed to make sure it matches its CID.
    HashOnRead(enabled bool)
}

I am not necessary against change Get() to accept a Multihash, but that will involve changing the Cid's in Blocks to Multihashs. This seams like more of a breaking change that can come later but I am not against attempting to push it though.

We will have a presentation problem when presenting raw multihashes to the user as they stand now they are indistinguishable from CIDv0. I see two solutions to this problem.

(1) Introduce a new Codec to mean "unknown" to the Cid standard as outlined in the description. When we need to display a raw Multihash it will be converted to a Cid with the "unknown" condec.

(2) Since we decided that CidV0 will not get a multibase prefix. Define a multihash with a multibase prefix as a raw multihash and never a CidV0. All multihashes will then be displayed with a multibase prefix when converted to a string.

Personally I like (1) better as it also solves the problem of how to handle the case when we have a raw multihash but need a full CID. We can eliminate the need as much as possible, but I still think it will come up from time to time based on the pervasiveness of CIDs.

kevina commented 5 years ago

@whyrusleeping @Stebalien the go-blockservice README says (https://github.com/ipfs/go-blockservice):

The interfaces here really would like to be merged with the blockstore interfaces. The 'dagservice' constructor currently takes a blockservice, but it would be really nice if it could just take a blockstore, and have this package implement a blockstore. What is the background on that and will the proposed interface above interfere with that goal?

Note the if I understand correctly we want to continue to make network requests using CIDs, thus the blockservice will continue to need to work with CIDs and can not switch to using Multihashes. Thus, my proposed changes to the blockstore might interfear with the interface merger. Or am I missing something?

Stebalien commented 5 years ago

I really would just use the raw multicodec.

My changing the Cid to a Multicodec (multihahs?) in a block will will be losing information. This will require a lot of careful thought so everything still works. I am against doing it at this stage.

My point was, instead of introducing an unknown multicodec, just use raw. The issue here is the ipfs refs command (and friends). Those need to return CIDs, unfortunately.

Note the if I understand correctly we want to continue to make network requests using CIDs, thus the blockservice will continue to need to work with CIDs and can not switch to using Multihashes. Thus, my proposed changes to the blockstore might interfear with the interface merger. Or am I missing something?

That's something @whyrusleeping would like but can wait. Basically, he just wants to get rid of the blockservice.

kevina commented 5 years ago

My point was, instead of introducing an unknown multicodec, just use raw. The issue here is the ipfs refs command (and friends). Those need to return CIDs, unfortunately.

Yes we can do that but as I stated above I don't see that as correct. By using a new Cid type we are making it clear that we don't know the Cid type. The only command this will effect is ipfs refs local. As long as we are following a dag will should have the full Cid.

That's something @whyrusleeping would like but can wait. Basically, he just wants to get rid of the blockservice.

Okay then. I won't consider it a blocker.

Stebalien commented 5 years ago

Yes we can do that but as I stated above I don't see that as correct.

It is correct. Every piece of data is a valid raw block (it's the Any, interface{}, Object, what have you, type).

kevina commented 5 years ago

It is correct. Every piece of data is a valid raw block (it's the Any, interface{}, Object, what have you, type).

Maybe it is correct, but I still think Raw and Unknown convey different pieces of information. Thus, I still think an Unknown codec will be useful but it is not important for this P.R. and is not a blocker on anything so we can revisit the issue later.

Stebalien commented 5 years ago

Ideally, we'll eventually deprecate the entire API in favor of something that works with raw multihashes.

Stebalien commented 5 years ago

Note: The multicodec on a CID is not a type. It tells you how to interpret a binary blob as structured data. From the perspective of IPLD, "unknown" is pretty much useless.

You are right, we could have a sentinel "invalid" type but I'd rather just produce useful CIDs.

kevina commented 5 years ago

Note: The multicodec on a CID is not a type. It tells you how to interpret a binary blob as structured data. From the perspective of IPLD, "unknown" is pretty much useless.

That is exactly the idea. It is suppose to be useless as we don't know the original codec as it was created from a raw multihash. Doing anything with it other than passing it around will result in an error. It is meant only for output of command like ipfs refs local.

lidel commented 2 years ago

I believe this can be closed: https://github.com/ipfs/go-ipfs-blockstore/pull/38 shipped and depending on context, we switched to multihash (datastore keys) or CIDv1 with raw codec (refs local output)

aschmahmann commented 2 years ago

Reopening for now for tracking, but we'll likely want to create a new issue. Related to ipfs/boxo#362