jbenet / random-ideas

random ideas
juan.benet.ai
324 stars 12 forks source link

JRFC 33 - Repositories #33

Open jbenet opened 9 years ago

jbenet commented 9 years ago

This document is an attempt at specifying a generalized spec for repositories (the git and ipfs kind) in the hope to arrive at a generalized set of good practices. I am new to many intricacies and edge cases, so please suggest important additions.


Many tools and systems create data repositories with configuration files. The classic example is git and other VCS tools, but many systems do. Application changes will necessarily bring about changes to the format of the repository (e.g. changing how data is stored, or changing the data itself). These should NEVER cause any data loss on users, and great care must be given to ensure all format changes are accompanied with migration tools.

As applications grow, different types of storage media or execution strategies may optimize different use cases e.g. "flat files inside .git for git cli" vs "git repo inside database for fast web server access". No matter the use case, application implementations should be able to operate with different concrete versions of the repository, provided suitable adaptors exist. This separation reduces the cost of writing new storage implementations, and new application implementations.

Terms:

Operations on a repo may require synchronization (some repos may support concurrent modifications, and others require complete mutual exclusion). Repos which require mutual exclusion must support mechanisms to achieve it (e.g. .git/index.lock). These may be granular or coarse, but repo formats must define synchronization, so various implementations can ensure safe, concurrent access.

Migrations

Migrations: through the lifetime of an application, repo formats may require changes. These changes must be accompanied a "migration tool", which convert the data from the most recent format version, to the new one. Ideally the upgrade can be applied in both directions (old <-> new). For example, one may end up with a set of "repo version migration" tools like the following:

> ls ipfs/bin/repo-migrations
1-to-2
2-to-3
3-to-4
4-to-5
5-to-6
6-to-7

> ipfs/bin/repo-migrations/1-to-2
repository version: 3
already up to date.

> ipfs/bin/repo-migrations/3-to-4
repository version: 3
applying path: 3-to-4
repository version: 4

> ipfs/bin/repo-migrations/5-to-6 --revert
repository version: 4
applying patch: 4-to-3
repository version: 3

> ipfs/bin/repo-migrations/run 1-to-7
repository version: 3
applying patch: 3-to-4
applying patch: 4-to-5
applying patch: 5-to-6
applying patch: 6-to-7
repository version: 7

It is advised that repo migration tools are virtual repo tools (that is, implemented to work with the logical repo, instead of the concrete data). This makes it possible to reuse migration tools across repo implementations (with proper adapters). This may not be possible always, repo-format-specific migration tools might be necessary.

human inspection

Repo implementations must include tools to transform the data to a human readable/inspectable structure. This makes it possible for users and application implementors to debug problems. These tools may be easiest to implement with a human readable repository format, and conversion tools to convert to/from it.

corruption

...

betabrain commented 9 years ago

This is very interesting. It made me think of storage in general... Most applications use data structures, config files, databases, or any combination thereof. These represent the virtual repo. The concrete repo is either RAM, a local file, or a service the other side of a network connection. Here are some thoughts.

What if there was a standard multirepo format?

access

var repo = require("multirepo").open("~/myrepo");

users = repo.access("userdb"); // userdb is a relational datastore
users.query('select * from users;');

notes = repo.access("notes"); // notes is an append only list
notes.append({note: "Hello world!", "ts": new Date()})

meta layer

logical layer

These are the data structures the application reads and writes.

> repo logical

name           structure         signature                     entries
config         map               string -> string                    4
index          list              path -> delta                      21
objects        map               sha1 -> commit, tree, blob       2409

concrete layer

> repo concrete

name         backend          size
config       git-config      0.3 K
index        git-index      12.1 K
objects      git-objects    17.9 M

batteries included

> repo migrate-backend objects ipfs-git

> repo concrete

name         backend          size
config       git-config      0.3 K
index        git-index      12.1 K
objects      ipfs-git       21.4 M