regen-network / regen-ledger

:seedling: Blockchain for planetary regeneration
https://docs.regen.network
Other
214 stars 103 forks source link

Data Module #108

Closed aaronc closed 3 years ago

aaronc commented 4 years ago

Summary


This proposes a module for tracking data related to ecological claims that lives both on and off-chain.

Problem Definition

The data module aims to satisfy the following use cases:

Can you say any more about this @clevinson ?

Proposal

Anchoring Data

Anchoring data refers to storing a hash of a piece of data on-chain that allows for proof of existence of that data at some block height. This effectively creates a “secure timestamp” for the data which proves that the data was created no later than the block height where it was included.

We propose a simple message MsgAnchorData for anchoring data on chain which uses the IPFS CID Specification as the expected format for data hashes:

message MsgAnchorData {
  // sender is the address of the party submitting the transaction
  bytes sender = 1;
  // cid is a binary IPFS CID identifier without a multibase prefix
  bytes cid = 2; 
}

Signing Data

Signing data refers to creating a claim that the signer of the piece of data attests to its veracity. What veracity means may be somewhat dependent on the context of the document being signed, but for simplistic purposes we draw an analogy to a legal document. If a party puts their signature on a legal document, its pretty clear from the contents of the legal document what the signature implies.

We propose a simple message MsgSignData for making on-chain signatures to data:

message MsgSignData {
  // signers is the addresses of the signers of the document
  repeated bytes signers = 1;
  // cid is the binary IPFS CID identifier of the document
  bytes cid = 2;
}

A few notes:

Storing Data

If desired, we can also store data directly on the blockchain. Off-chain storage will generally be cheaper, but on-chain data storage provides the following benefits:

We propose a simple message MsgStoreData for storing data on-chain:

message MsgStoreData {
  bytes sender = 1;
  // cid is the binary IPFS CID
  bytes cid = 2;
  // content must match the provided CID
  bytes content = 3;
}

The provided content bytes should be verified against the provided CID. While MsgAnchorData and MsgSignData should support any valid CID, MsgStoreData should support only an approved list of formats and hashes with gas priced appropriately for each supported hash. 

Future Improvements

Off-chain Data URL Index

One simple improvement to the above design is to allow for URLs to be stored for any CID. This effectively creates an index of

Partial Data Storage

Using a merkle-tree based canonicalization/hashing algorithm as described in #64, we could allow for part of a piece of data to be stored on-chain using merkle proofs. This would allow for:

Secondary indexes

We could allow for certain data properties within well defined document structures to be automatically indexed on chain so that they are searchable in smart contracts. For instance, we could define a property that defines a geographic polygon and whenever a piece of data is stored that contains that property, it is indexed in an on-chain geospatial data store. Requires more in-depth research. See regen-network/regen-ledger#87 .

Schema Validation

On-chain data could be validated against some schemas for conformity. This may or not be the responsibility of on-chain consensus and would depend heavily on the use case. Schema validation if it were to be incorporated would likely relate to some on-chain schema registry which is out of the scope of this proposal.

aaronc commented 4 years ago
do-not-edit-start-codetree-epic-issues

Issues in this epic:

Title Milestone Assignees Stage State
Create merkle-tree based RDF canonicalization spec #64 N/A N/A Open
do-not-edit-end-codetree-epic-issues
clevinson commented 4 years ago

As far as problem definition- i'd like to flesh out 2 interrelated use cases that will be good to orient this work around:

Regen Registry & Credit Module Implementation

As part of our issuing of CarbonPlus Grasslands credits, the credit module keeps it's use of metadata quite minimal (see credit rfc). This metadata is meant to link to claims about ecological data. While the credit RFC in its current form only requires arbitrary metadata blobs that may link to offchain data, we should also consider the use case of metadata from a credit linking to a previous claim (or baseline monitoring assesment) that is also represented as an on-chain asset. In this case, it would be good to have an easy way for the credit module to point to a dataset that was stored or signed using the operations described here.

OpenTEAM SurveyStack Digital Signatures Work Package

Some of the first technical partners that we will be wanting to make use of data anchoring, storage, and signing capabilities are ones from the OpenTEAM.

This year we had an approved paired work session to work with SurveyStack (a generic survey form application) as an initial partner for adding digital signature capabilities to tools in the openteam ecosystem. This is currently a proposed item for SurveyStack's "Year 2 Research Farm Roadmap" proposal (see Blockchain / Ledger integration section).

This work package actually connects quite closely with the previous Regen Registry use case, as the end-use case for this SurveyStack integration is the monitoring surveys that are being completed by Regen's science team as part of the monitoring tasks for issuing a CarbonPlus Grasslands credit.

clevinson commented 4 years ago

In response to the actual proposal, I think this is a great starting place. Anchoring data, Signing data, and Storing data all being treated as separate operations makes sense to me - and by use of the same CID standard, they can actually connect to each other.

2 questions I have though:

aaronc commented 4 years ago

In response to the actual proposal, I think this is a great starting place. Anchoring data, Signing data, and Storing data all being treated as separate operations makes sense to me - and by use of the same CID standard, they can actually connect to each other.

2 questions I have though:

  • what kind of querying functionality would we like to support?

    • the ability to index by CID ? (give me all signatures, and/or content corresponding to some CID) ?

Yes

  • will there be any deduplication efforts to prevent storing multiple anchorings of the same dataset, what about multiple sign records and/or overlapping groups of signers? How will these be resolved when querying?

Yes deduplication of data and yes multiple signers.

  • Even at this first initial step, it would be really great to have some optional parameters that provide geoJSON / polygon support. Is that something that can be easily added to this scope? I think that even the most basic structured polygon data alongside CID's would be a huge-win and make early adopters a lot more excited to experiment with the module.

Content can include polygon data if it chooses. I'm resistant to adding some geo capability on top of this just because. I believe we will eventually want to specify more structured data, ideally with a Merkle tree based CID hash (#64) that includes some geo polygon spec. It's still only really relevant on chain if it's indexed IMHO.

robert-zaremba commented 4 years ago

Thoughts

The idea to use blockchain for timestamping data one of canonical use-cases. It was used on Bitcoin, Ethereum and even Ripple. There is also lot of academic research (eg: Stampery Blockchain Timestamping Architecture )

  1. It seams that the x/data is duplicating a blockchain (blocks data) - we copy the data from transaction and wrap them in more meaningful structures without doing additional operations.
  2. We claim that this data won't be manipulated. TBH, I'm not sure if this is easy without a code audit and proper module authorization mechanism (which we started to address in SDK).

If we don't do any operation then more natural way would be to wrap the transaction data in Events and use off chain DB for indexing. The verification process would be following: query the index for a specific data. This will result with a transaction_id or an event. Then you can use it to query tendermint to validate that the event / transaction indeed happened at a particular time. This would solve the issues above and keep the blockchain slim.

aaronc commented 4 years ago

That's true @robert-zaremba. I guess one complication is we don't have access to the tendermint block history in the SDK state. So if we wanted to use the data or signers in a smart contract we couldn't... But yes it is duplication

robert-zaremba commented 4 years ago

Wouldn't be better to allow smart-contracts to select Events? A smart-contract method argument would require an event id.

clevinson commented 3 years ago

Closing this as the basic implementation was completed in #118 and #124