multiformats / multicodec

Compact self-describing codecs. Save space by using predefined multicodec tables.
MIT License
337 stars 201 forks source link

Add likecoin-iscn codec #200

Open Aludirk opened 3 years ago

Aludirk commented 3 years ago

We are going to implement the International Standard Content Number Specification by using IPLD as our linked data structure, it is a global identifier for the digital content.

The IPLD plugin for this implementation is here.

rvagg commented 3 years ago

Thanks @Aludirk, could you help me with understanding this just a little bit more please? What is the intended use-case for the codes, I see you've used a new tag in the column, so is it right to assume these are not "codecs" in the IPLD sense? Or are these referencing binary blob formats that get encoded and decoded in some unique way (each one having unique rules)?

Aludirk commented 3 years ago

Thanks for the reply @rvagg We are working on a blockchain project called LikeCoin-chain for content monetization, attribution and distribution. And the International Standard Content Number Specification (ISCN) is a general digital content registration schema with the following data model:

The ISCN is some kind of ISBN for digital content and is suitable for us to use as the schema of the digital content metadata. Meanwhile, we want to let the public can easily access the metadata inside the blockchain (since a technical background is required for getting data from the chain directly, it is not user friendly). Using IPFS as a distribution method which will be easy for the users to find the digital content metadata by type an ipfs command only. Therefore, we translate our chain data to IPLD and pin it for users to access every time we have a new/updated digital content metadata.

The reason why we are using eight codecs is that we want to mimic the ISCN data model as mentioned before. Each codec is corresponding to a specific ISCN schema which has different properties and validation rule.

vmx commented 3 years ago

From a quick look at the source code, it looks like you're encoding your data as DAG-CBOR (please note that I have almost no Go knowledge). Is the idea that you want the type you've encoded as part of the CID?

Aludirk commented 3 years ago

Yes, every schema will use canonical CBOR provided by IPFS for encoding and choose an appropriate decoder for decoding based on the codec extracted from the CID.

vmx commented 3 years ago

@Aludirk the codec identifier in the CID is meant as an information on how to decode the data and not how to interpret the data itself. So the CIDs for all your data would use the DAG-CBOR codec. Your application logic would then encode it and then determine which type of data it is.

There are several ways doing that. You could wrap your whole object in something like:

{
  "type": "one of your types",
  "data": {
    // The actual payload
  }
}

Or, when I look at your schema descriptions it looks like they already have a context field, you could also use that one to differentiate.

As I talk about schemas already, you might also be interest to have a look into IPLD Schemas which could help with differentiating the data as well: https://specs.ipld.io/schemas/

Aludirk commented 3 years ago

ok, I got it, for the data structure, I can make the type down a layer and only use one codec

However, I still can't directly use DAG-CBOR for the codec. Since our data is actually stored inside a CosmosSDK based public blockchain called LikeCoin chain. When anyone tries to ask for the content by the CID, our daemon has a datastore plugin to find the concrete data inside the chain and answer the request. If there has no clue in the CID to let the chain to determine whether should try to get data from the chain itself, this plugin will not work. And by the performance consideration, I cannot let the datastore plugin try to access the chain for every DAG-CBOR CID.

Therefore, I suggest that I reserve only one codec for the ISCN IPLD, so that our datastore plugin can work as the expectation.

I can make a new commit if this suggestion is ok.

aschmahmann commented 3 years ago

@Aludirk sorry for the large info dump below but figured it might be helpful to put out some thoughts + ideas on your go-ipfs integration strategy. There are office hours later today listed on the community calendar if you'd like to talk a bit in person, of course talking here or on Matrix/IRC also works.

TLDR: I think there are a few ways to tackle this problem, I don't think this approach (adding a likecoin codec into CIDs) is ideal if we can avoid it but I'd like to know if any of the existing strategies might work.

High level thoughts on getting ISCN data via IPFS

At a high level IPFS wants to do content routing which means finding data based on what it is and not where it is. This scheme you've proposed attaches the location where content should be looked up (i.e. the likecoin blockchain) to the data itself by embedding it in the CID. For example, maybe some of your users want to fetch the data via Bitswap from each other so that they can resolve ISCN CIDs offline.

It sounds like what you're really looking for here is "routing hints", i.e. a way to specify "I'm looking for CID bafyabc..., try looking for it in locations A, B, or C since they're likely to have it". I've heard this talked about before across a number of issues (e.g. [here](e.g. here), but I can't find a definitive one off hand (perhaps this means we need to make a new one, unless there's an issue I've missed @Stebalien).

Overall though the issue of finding a CID via multiple possible systems really only has two solutions: 1) Know/guess that a given CID is likely to live in some external system (i.e. content hint)

Questions/Thoughts about how you're planning on building this and utilizing go-ipfs

I'd really like to have something that handles option 1 nicely, but that might take time to design, get built and make its way into systems such as IPFS and I wouldn't want your team to be held up by that. Therefore, my thoughts + questions below are to explore whether option 2 could work for you and poke a little bit more at your setup.

The way go-ipfs normally works is that when searching for data it will (in order):

  1. Check if it already has it in its datastore
  2. Ask peers it's connected to via Bitswap if they have the data
  3. Ask the public IPFS DHT who has the data

By inserting yourself at the datastore level you're avoiding steps 2 + 3 which could also help you find the data. What you might want to do instead is something like one of:

Do you mind going into this a little bit more:

If there has no clue in the CID to let the chain to determine whether should try to get data from the chain itself, this plugin will not work. And by the performance consideration, I cannot let the datastore plugin try to access the chain for every DAG-CBOR CID.

Both your approach and option A above require making extra calls to the blockchain node. For example, every time a go-ipfs node tries to look for a non-likecoin CID it can't find anywhere else it'll ask the blockchain node where to find it. Is asking the blockchain node "do you have CID bafyabc?" very expensive and is option B similarly too expensive/implausible?

Implementation notes on likecoinds

If I understand your setup in https://github.com/likecoin/likecoin-ipfs-cosmosds correctly then you're doing a GET on the blockchain any time the data is asked for instead of checking if you already have it stored locally. You're also using levelDB instead of the recommended BadgerDB or FlatFS datastores, you could probably just wrap one of those datastores instead of copying them internally.

nnkken commented 3 years ago

Hi this is Chung, one of the developer of ISCN.

To be honest we are not very familiar with the mechanism of IPFS, so I think it is better to write down our idea here, to see if we have made anything wrong.

Our idea is to have blockchain nodes to run a go-ipfs process / thread, with the datastore plugin installed. When the blockchain node receive transactions for adding new ISCN data, it notifies the go-ipfs process / thread to pin the associated CID (not implemented yet), so by IPFS mechanisms (I only know about DHT, but I think this part is handled by the nodes internally, so by calling the pin API I should not need to care?), other IPFS nodes on the IPFS network will be able to know that these CIDs could be retrieved from this IPFS process / thread.

When some node wants to retrieve the CID, the IPFS process / thread calls the datastore plugin. The current design of the datastore plugin is first distinguish if the CID is for ISCN data, and if not, then proxy the call to other datastore plugins. Currently we copied leveldb datastore, which is for proof-of-concept, we are going to make it a wrapper around other existing plugins, just like what you suggested. If the CID is for ISCN data, then it will proxy the call to the blockchain node through RPC, which we implemented get, get-size and has on the blockchain side.

The above is the design based on our understanding. I hope that we didn't made any huge design flaw because of misunderstanding.

Back to the IPLD codec. I agree that we should not occupy a codec type just for routing hint purpose, and I think for this part we would be able to workaround (e.g. storing Bloom filter of existing CIDs in the datastore plugin) if we don't have the codec type.

But I want to know that is the codec type simply for serialization and deserialization? i.e. If we have our new data type (ISCN), which the binary storage type is JSON / CBOR / other existing codec, while we have added more semantic and verifications on that, in this case may we own a new codec type for this new data type?

Also I would like to have an estimation that for a typical IPFS node on the public network, how often would it receive queries which needs the datastore plugin to handle? My impression when doing the proof-of-concept (from console log) is around 1 per second, but I'm not sure if this is a typical case.