Scalable MRV data storage and transformation provenance capabilities

Problem description

At the very high level Guardian policy execution boils down to the following workflow:

get some data (from sensors, or humans) publish as a VC (in IPFS)
do some transformations
record the result in a VC doc, publish (in IPFS)
get some more data
combine with previous and do some more transformations
record the result in a VC doc, publish (in IPFS)
repeat the cycle 1-6 numerous times
create a token (in Hedera)
repeat the entire cycle 1-8 until END

The underlying technologies that Guardian uses for storage are IPFS and Hedera Topics.

IPFS works very well for documents but is not very efficient for data, in particular data which undergoes many transformations, each of which needs to be verifiably performed and recorded.

Hedera Topics have content size limitations and is not do not have efficient addressing system.

For many real-world use-cases the required volume and complexity of calculations (and thus transformations) on the original MRV data is such that full automation of such workflows using existing Guardian technology will likely be very challenging if not impossible.

Requirements

Identify and integrate with a distributed storage technology to allow Guardian to scalably work directly with data (similarly how it would have worked with a relational database) while maintaining a full record of data provenance and guaranteed policy adherence verifiability for the data processing and transformations.

Some relevant links:

LFEdge Alvarium - building the concept of a Data Confidence Fabric (DCF) to facilitate measurable trust and confidence in data and applications spanning heterogeneous systems.
Content Addressable Transformers - a unified software framework that enables data and process verification and provenance as chains of evidence for retrieval and re-execution via content-addressing the means of processing (input, process, output, infrastructure-as-code).
ComposeDB - a decentralized, composable graph database.
Tableland an open source, permissionless cloud database for reading and writing tamperproof data from apps, data pipelines, or EVM smart contracts.

Definition of done

Efficient data storage technology is integrated into Guardian
Documentation is updated accordingly
At least a single examples of the complex or mass-scale data transformations relying on the new data storage technology are introduced into one of the sample policies

Acceptance criteria

Guardian is able to handle mass volume of data and their complex transformation sequences/logic on the level of statistical analysis tools

The traditional way to address step 5 (combine with previous and do some more transformations) is a relational database because it efficiently stores data and it has sophisticated means of combining data (queries and joins etc). The ability to store data that changes at a high cadence is one part of the requirement, the other is the ability to query such data. Example: Every instrument that is deployed has a set of requirements determining the validity of the data from that instrument that is not a function of the data itself. Typical requirements are things like inspection and callibration frequency. The protocol may require that an instrument is inspected once in six months and calibrated once a year. A read from an instrument is then actually a query of the conditions of validity for the data (e.g. the calibration and inspection logs) as well as the data itself: "Get all data from instruments that still have valid calibrations " . If the query is deterministic, perhaps the query itself and a timestamp is enough to deliver on the promise of transparency and immutablity (" I did this query at this time and got this result")

hashgraph / guardian