qri-io / rfcs

Request For Comments (RFCs) documenting changes to Qri
MIT License
12 stars 6 forks source link

RFC Request: Fragments #19

Open dustmop opened 6 years ago

dustmop commented 6 years ago

Fragments

A number of talked about features, possibly coming soon over the horizon, involve a similar need of extra data outside of the dataset body and existing meta. This document proposes the name fragment as a way to talk about this new support data, and shows some use cases that need this change.

Today, datasets consist of numerous pieces of data, including:

Importantly, these all exist as files within the dataset stored in its content-addressed file system, such that changing anything about these pieces will modify the dataset hash.

Fragments

Fragments are a new piece of data, related to datasets, but exist externally to them. They all have a link to the hash of the primary dataset. They are intended to solve a number of disparate use cases:

These are all problems that require some external structure to solve, and thus cannot be implemented with existing facilities. The reason they need to be external is that they must write information about a dataset that already exists, but must also refer to that dataset, and therefore cannot change the cafs hash.

Example implementation

type Fragment struct {
    DatasetPath string, // profile_id/network/version_id
    Created timestamp, // populated when this Fragment is created
    // ... additional data, or this struct can be included as a subtype
}

Fragments are stored on the distributed web, using a content-addressable file system, such as IPFS, just like Datasets are.

Small Updates

Currently, modifying a Qri dataset using update makes a complete copy of the dataset body. This is acceptable if changes are infrequent, or change a lot of data at once, but quickly becomes untenable once datasets have frequent, tiny updates in a large volume.

Fragments can solve this by declaring a small "append" entry that links to the existing dataset or most recent small-update fragment. After a certain period of time, or number of appends are made, these fragments could be compacted together in order to form a completely new dataset version.

Random access indexes

Assume a dataset of fairly large size, upon which a user wants to retrieve only a single row near the end of the body. With the current state of the world, the only way to do so is to read the entire dataset, parse all of it until the desired row is found, and then return that result. This is inefficient, and gets worse the larger datasets become.

Instead, we could use fragments that link to the original dataset, and include an index mapping from entry numbers to byte offsets within the body. This would greatly speed up accesses to arbitrary positions within the dataset.

The same concept could create column oriented indexes, or sorted full search indexes, or indexes with different views.

Private Data

Private data is not yet supported for Qri. Once it is implemented, users will want to get the benefits of Qri's provinance, ensuring data is valid and tamper-free, without needing to push their data onto the distributed web.

Fragments can help here by keeping the private data on some internal network / machine storage, calculating the hash of that data, and then building a Fragment pointing to that private hash. Once the Fragment is on the distributed web, provinance is achieved, verifying that the private data exists somewhere, but only the hash is visible, not the data itself.