jbenet commented 7 years ago

This proposes some tooling for large datasets. Warning! As soon as I wrote it, i already want to change it. in particular, I want to change the db thing to just be a normal ipfs repo. it would help with the serving, too. We just need to lang making ipfs repos super fast with swappable datastores (right now we can't quite do that).

Proposal posted at https://gist.github.com/jbenet/deda429fae2e5af9a86a01b0cbb614f7 and reproduced below for those getting email.

I will update it with the db -> repo thoughts, and update the gist and the comment below. I will comment when i update it so people get a notification, at least.

IPFS Tooling for datasets

Background

We need some tooling for a certain set of use cases around archival and dataset management. This tooling is for fitting how people work with large files and large datasets.

Grounding Assumptions

Basic grounding assumptions here:

datasets are "large" (From GB to EB in size)
datasets should not be duplicated in the filesystem (eg into a .ipfs repo)
datasets may have different versions
datasets (at a particular version) are exactly determined (can be hashed)
people prefer to read and manipulate the datasets in a "working directory" style
it is not enough to have an HTTP or RPC API, but rather a POSIX filesystem api is essential
datasets can be represented as a tree of POSIX files and directories
datasets may be moved using non-ipfs tools
it would be useful to easily replicate and back up the content (ipfs, ipfs-cluster)
it would be useful to easily serve the content on the web (ipfs-gateway)
it would be useful (but not necessary) to digitally sign manifests

Why current IPFS tooling is not enough

The current ipfs tooling assumes we can import all data into a .ipfs repository directory. There are ongoing efforts to build filestore to allow referencing content outside of that directory, but this is not yet finalized, and all metadata is stored in the .ipfs repository, not with the directory in question.

We have often discussed Certified ARchives (.car) as a replacement for tar. This could be a future replacement, along with a reliable way to mount the .cars, but this is not yet here either.

Other tooling examples

BagIt - https://tools.ietf.org/html/draft-kunze-bagit-06#section-2.1.3
WARC - https://en.wikipedia.org/wiki/Web_ARChive
BitTorrent's "manifest-like" .torrent file

Tools for archiving websites:

https://github.com/edgi-govdata-archiving
The Internet Archive offers Brozzler as a tool for crawling and archiving sites.
Web Recorder lets you create verifiable web archive files for submission to the Internet Archive or hosting on your own.

Proposed Tooling Additions

This document proposes the addition or adjustment of the following tools:

dagger/dagify (or whatever is decided here) - a standalone tool that reads in a file or directory and outputs an (in-order) ipld graph, according to a given format string.
ipfs-pack - a standalone tool that creates an "ipfs pack" (similar to WARCs, BagIt, and .torrent files, but with IPLD and importers magic).
datadex or maybe gx-dataset - a tool to prepare and publish a dataset (as an ipfs-pack, guides user to add dataset metadata and license info, and publishes to a registry)
car (still only a proposed tool) which create certified archives (single-file hash-linked archive, like a hash-linked .tar), will work closely with ipfs-pack.
The ipfs repo filestore abstractions can leverage ipfs-packs to understand what is being tracked.

`dagger/dagify`

This tool (name discussion here) reads in a file or directory and outputs an (in-order) ipld graph, according to a given format string.

> dagger -fmt <fmt-string> -r foo/bar/baz
<ipld-object>
<ipld-object>
<ipld-object>
<ipld-object>
<ipld-object>

Where <fmt-string> is a format string that uniquely determines (for ever) the whole dag structure, including chunking scheme, index layout, what is tracked in the index, what is left as raw nodes, etc. The idea is that this string (which ideally will be short) can uniquely describe a strategy for representing the source content as the output ipld graph, and that it can repeatably do so. Meaning that once a given fmt string produces one output, it should never change (lest there is a major bug). This is because people must retain the ability to verify their content, and they need some primitive to do so.

`dagger/dagify --only-cid --only-root`

This tool will have an --only-cid flag that ouputs only the cids:

> dagger -fmt <fmt-string> -r foo/bar/baz --only-cid
<ipld-object-cid>
<ipld-object-cid>
<ipld-object-cid>
<ipld-object-cid>
<ipld-object-cid>

And an --only-root flag that returns only the last (root) object or cid.

> dagger -fmt <fmt-string> -r foo/bar/baz --only-root
<last-ipld-object>

> dagger -fmt <fmt-string> -r foo/bar/baz --only-cid --only-root
<last-ipld-cid>

`ipfs-pack` filesystem packing tool

The idea is that ipfs-pack is a filesystem packing tool, that establishes the notion of a bundle, bag, or "pack" of files. We use pack to avoid confusing it with a Bag from BagIt, a very similar format (that ipfs-pack is compatible with). The way "packs" work is this:

There MUST BE a pack root directory that defines the pack. (eg at <path-to-pack-root>/) It contains all the pack contents and represents the pack in a filesystem.
There MUST BE a pack manifest file that tracks the contents ipfs hashes of the pack contents. (<pack-root>/PackManifest)
There MAY BE a pack object database cache file or directory that stores metadata on all the ipld objects in the pack. This is ancilliary and can be reconstructed from a pack root at any time.

Subcommands

> ipfs-pack -h
USAGE
    ipfs-pack <subcommand> <arguments>

SUBCOMMANDS
    make     makes the package, overwriting the ipfs-pack manifest file.
    verify   verifies the ipfs-pack manifest file is correct.
    db       creates (or updates) a temporary ipfs object database `.ipfs-pack/db`
    serve    starts an ipfs node serving the pack's contents (to IPFS and/or HTTP).
    bag      create BagIt spec-compliant bag from a pack.
    car      create a `.car` certified archive from a pack.

Usage Example

> pwd
/home/jbenet/myPack

> ls
someJSON.json
someXML.xml
moreData/

> ipfs-pack make
> ipfs-pack make -v
wrote PackManifest

> ls
someJSON.json
someXML.xml
moreData/
PackManifest

> cat PackManifest
QmVP2aaAWFe21QjUujMw5hwYRKD1eGx3yYWEBbMtuxpqXs moreData/0
QmV7eDE2WXuwQnvccsoXSzK5CQGXdFfay1LSadZCwyfbDV moreData/1
QmaMY7h9pmTcA5w9S2dsQT5eGLEQ1CwYQ32HwMTXAev5gQ moreData/2
QmQjYU5PscpCHadDbL1fDvTK4P9eXirSwD8hzJbAyrd5mf moreData/3
QmRErwActoLmffucXq7HPtefBC19MjWUcj1DdBoaAnMm6p moreData/4
QmeWvL929Tdhzw27CS5ZVHD73NQ9TT1xvLvCaXCgi7a9YB moreData/5
QmXbzZeh44jJEUueWjFxEiLcfAfzoaKYEy1fMHygkSD3hm moreData/6
QmYL17nYZrZsAhJut5v7ooD9hmz2rBotC1tqC9ZPxzCfer moreData/7
QmPKkidoUYX12PyCuKzehQuhEJofUJ9PPaX2Gc2iYd4GRs moreData/8
QmQAubXA3Gji5v5oaJhMbvmbGbiuwDf1u9sYsN125mcqrn moreData/9
QmYbYduoHMZAUMB5mjHoJHgJ9WndrdWkTCzuQ6yHkbgqkU someJSON.json
QmeWiZD5cdyiJoS3b7h87Cs9G21uQ1sLmeKrunTae9h5qG someXML.xml
QmVizQ5fUceForgWogbb2m2v5RRrE8xEm8uSkbkyNB4Rdm moreData
QmZ7iEGqahTHdUWGGZMUxYRXPwSM3UjBouneLcCmj9e6q6 .

> ipfs-pack db make
> ipfs-pack db make -v
wrote .ipfs-pack/db

> ls -a
./
../
.ipfs-pack/
someJSON.json
someXML.xml
moreData/
PackManifest

> find .ipfs-pack/
.ipfs-pack/
.ipfs-pack/db

`ipfs-pack make` create (or update) a pack manifest

This command creates (or updates) the pack's manifest file.

ipfs-pack make
# wrote PackManifest

`ipfs-pack verify` checks whether a pack matches its manifest

This command checks whether a pack matches its PackManifest.

# errors when there is no manifest
> random-files foo
> cd foo
> ipfs-pack verify
error: no PackManifest found

# succeeds when manifest and pack match
> ipfs-pack make
> ipfs-pack verify

# errors when manifest and pack do not match
> echo "QmVizQ5fUceForgWogbb2m2v5RRrE8xEm8uSkbkyNB4Rdm non-existent-file1" >>PackManifest
> echo "QmVizQ5fUceForgWogbb2m2v5RRrE8xEm8uSkbkyNB4Rdm non-existent-file2" >>PackManifest
> touch non-manifest-file3
> ipfs-pack verify
error: in manifest, missing from pack: non-existent-file1
error: in manifest, missing from pack: non-existent-file2
error: in pack, missing from manifest: non-manifest-file3

`ipfs-pack db` creates (or updates) a temporary ipfs object database

This command creates (or updates) a temporary ipfs object database (eg at .ipfs-pack/db). This object database contains positonal metadata for all IPLD objects contained in the pack. (It follows the ipfs repo filestore metadata concerns). It MAY be a different, simpler object-db format, or be a full-fledged ipfs node repo using filestore.

The db is a simple key-value store that supports:

maps { <ipld-cid> : <filestore-descriptor> }
supports: list() []<ipld-cid> to show all cids in db
supports: put(<ipld-object>) <ipld-cid>
supports: get(<ipld-cid>) <ipld-object>
supports: putDescriptor(<ipld-cid>, <filestore-descriptor>)
supports: getDescriptor(<ipld-cid>) <filestore-descriptor>
supports: delete() to remove itself from disk

Notes:

<filestore-descriptor> is the metadata necessary to reconstruct the entire object from data in the pack.
{get,put} should be able to add or retrieve the objects from db or from the data in the pack.
{get,put}Descriptor should be able to add or retrieve file descriptors for objects stored in the pack.
Intermediate ipld objects (eg intermediate objects in a file, which are not raw data nodes) may need to be stored in the db.

This database basically implements:

type PackObjectDB interface {  
  // Make creates or updates a pack-db at packdbPath, 
  // with data for all the objects in the pack at packPath.
  Make(packPath string, packdbPath string) error

  // Put associates the given FileDescriptor with the given ipld.CID
  // if filestore.Descriptor is nil, Put removes the entry for ipld.CID (rm)
  Put(ipld.CID, filestore.Descriptor) error

  // Get retrieves the FileDescriptor associated with the given ipld.CID
  Get(ipld.CID) (filestore.Descriptor, error)

  // List returns all ipld.CID stored in the database
  List() (<-chan ipld.CID, error)

  // Delete deletes all the database contents and clears all files
  Delete() error
}

And does so both through a programmatic interface (some go package), or via cli tooling:

> ipfs-pack-db --help
USAGE
    ipfs-pack-db <subcommand> <arguments>

SUBCOMMANDS
    make     creates (or updates) the pack-db for a pack directory
    list     lists all cids in the pack-db
    put      adds a (cid, filestore-descriptor) entry.
    get      retrieves the filestore-descriptor for a given cid.
    delete   removes all files representing the pack-db (destructive)

`ipfs-pack serve` starts an ipfs node serving the pack's contents (to IPFS and/or HTTP).

This command starts an ipfs node serving the pack's contents (to IPFS and/or HTTP). This command MAY require a full go-ipfs installation to exist. It MAY be a standalone binary (ipfs-pack-serve). It MUST use an ephemeral node or a one-off node whose id would be stored locally, in the pack, at <pack-root>/.ipfs-pack/repo

> ipfs-pack serve --http
Serving pack at /ip4/0.0.0.0/tcp/1234/http - http://127.0.0.1:1234

> ipfs-pack serve --ipfs
Serving pack at /ip4/0.0.0.0/tcp/1234/ipfs/QmPVUA4rJgckcf1ifrZF5KvwV1Uib5SGjJ7Z5BskEpTaSE

`ipfs-pack bag` convert to and from BagIt (spec-compliant) bags.

This command converts between BagIt (spec-compliant) bags, a commonly used archiving format very similar to ipfs-pack. It works like this:

> ipfs-pack bag --help
USAGE
  ipfs-pack-bag <src-pack> <dst-bag>
  ipfs-pack-bag <src-bag> <dst-pack>

# convert from pack to bag
> ipfs-pack bag path/to/mypack path/to/mybag

# convert from bag to pack
> ipfs-pack bag path/to/mybag path/to/mypack

`ipfs-pack car` convert to and from a car (certified archive).

This command converts between packs and cars (certified archives). It works like this:

> ipfs-pack car --help
USAGE
  ipfs-pack-car <src-pack> <dst-car>
  ipfs-pack-car <src-car> <dst-pack>

# convert from pack to car
> ipfs-pack car path/to/mypack path/to/mycar.car

# convert from car to pack
> ipfs-pack car path/to/mycar.car path/to/mypack

`datadex` or maybe `gx-dataset`

WIP

a tool to prepare and publish a dataset (as an ipfs-pack, guides user to add dataset metadata and license info, and publishes to a registry)

`car` - certified archives

WIP

cars would interop with packs.

The `ipfs repo filestore`

WIP

Maybe the ipfs repo filestore abstractions can leverage ipfs-packs to understand what is being tracked in a given directory, particularly if those packs have up-to-date local dbs of all their objects.

jbenet commented 7 years ago

Question about BagIt: how does it do hashes of directories / the tree? i didnt see that when looking at the spec, but i may just have missed it.

jbenet commented 7 years ago

the db thing would be super useful for large things, it could just be a local .ipfs repo-- just we'd need to train go-ipfs to leverage repos it finds in directories when adding, or tracking such a "subrepo". something like a "ingest this repo" but without copying the data, just does a union. need to figure out how to make those subrepo accesses fast.

flyingzumwalt commented 7 years ago

@edsu can we entice you to comment? Your input might make all the difference.

flyingzumwalt commented 7 years ago

Lol. "ipfs-pack your bags" That's very clever @jbenet

edsu commented 7 years ago

@jbenet it's true empty directories are not present in the BagIt manifest. In practice some folks who have wanted to preserve the presence of empty directories have created an empty .keep file in the directory or documented the directories' presence in the bag-Info.txt

I'm really interested to learn more about what the IPLD representation would look like. I am definitely not up on all the IPFS features/functionality. Would the putDescriptor allow people to add metadata about their packages, such as a name for the dataset, who created it, etc?

My understanding is that IPFS is largely file oriented. Is it fair to say that this proposal adds the notion of sets of files and tools for working with them?

You are probably familiar with them already, but this makes me think of two otherpoint of reference for work in this area that you might be interested in:

Data Packages which @rufuspollock could tell you more about.
Packaging on the Web which is a W3C effort that @JeniT could speak to.

I suspect both would be interested in the work you are proposing.

edsu commented 7 years ago

Oh, and of course @maxogden's Dat comes to mind. I know you are talking already. I wonder if some of this higher level of abstraction and tooling around datasets could be handled by an IPFS enabled Dat?

rufuspollock commented 7 years ago

@jbenet I don't know how familiar with some of the Frictionless Data stuff and esp Data Package - http://specs.frictionlessdata.io/data-packages/?

I know you have commented a bit a couple of years ago on the Frictionless Data stuff when you were looking at data package managers and we discussed it at some length in London last year. In general, I'd say if you are looking at a simple structure on disk for describing a "package" of data it is would be a good fit.

I should say I did not immediately grok what exactly you are up to here from the description above e.g. what is an .ipfs repository (and how it relates to overall design of ipfs).

flyingzumwalt commented 7 years ago

Another important reference: pairtrees -- these are commonly used in the archives space. It's the tool that many digital preservation teams reach for first when they want to preserve a lot of bits.

Pairtree is a filesystem hierarchy for holding objects that are located by mapping identifier strings to object directory (or folder) paths two characters at a time. Pairtrees have the advantage that many object operations, including backup and restore, can be performed with native operating system tools.

rht commented 7 years ago

(redirected from https://github.com/ipfs/archives/issues/96#issuecomment-272924005)

If I read the draft proposal correctly:

a manifest file is a flat hash table of the filenames, almost like the output of vanilla ipfs add
No further metadata are specified, such as byte boundaries like in a .torrent file, permission string, no size, mime type
If "data" is a git repo, how is this versioned? Is the entire content of '.git' added like any other file, or does the versioning scheme adapt to the pre-existing one used by the repo?

I also made some sample packaged files which each contains a build script and ipld-based file list for the manifest (where details like the ones in json-schema can be added, e.g. mime-type). Ref: https://github.com/ipfs/archives/pull/101 (where I assume the default handler for software packages is gx).

I have made sure to find the generic common ground among metadata fields and spec from dat, frictionlessdata, json-schema (have yet to fold in bagit, .torrent file, warc, pairtree -- imo it is clearest if there is a comparison matrix table among all of these standards).

flyingzumwalt commented 7 years ago

@edsu commented

My understanding is that IPFS is largely file oriented.

The messaging and the name IPFS mislead people into thinking IPFS is just for files. It's actually a content-addressed protocol for distributing Merkle DAGs, which you can use to represent anything. That's why IPLD is so important -- it gives us a basic structure for representing any data structure as a DAG that can be written directly to IPFS and addressed using IPLD paths.

Is it fair to say that this proposal adds the notion of sets of files and tools for working with them?

ipfs-pack is a move in that direction. For the first pass, ipfs-pack will help us support the use case where users Use Manifest Files to Track Directory Structure & Contents, which allows us to Track a Directory and Serve it on IPFS without making duplicate local copies of the data. This will eventually allow us to Round-trip whole directories through IPFS and Mount directories by auto-detecting their ipfs-pack manifests or prebuilt object databases

That gives a very strong starting point for using IPLD to properly represent sets of files, but there will certainly be more work to establish the best metadata patterns.

rht commented 7 years ago

The draft proposal was made after ipld's existence, but in the example, in the content of the PackManifest in https://github.com/ipfs/notes/issues/205#issue-197357094 is misleading since it is not in ipld format, hence no file attributes.

rufuspollock commented 7 years ago

@flyingzumwalt have you taken a look at the Data Package specs -- it seems the basic Data Package could act as a reasonable match for the manifest file here.

http://specs.frictionlessdata.io/data-packages/

ipfs / notes

Proposing some tooling for datasets (ipfs-pack and stuff) #205

IPFS Tooling for datasets

Background

Grounding Assumptions

Why current IPFS tooling is not enough

Other tooling examples

Proposed Tooling Additions

`dagger/dagify`

`dagger/dagify --only-cid --only-root`

`ipfs-pack` filesystem packing tool

Subcommands

Usage Example

`ipfs-pack make` create (or update) a pack manifest

`ipfs-pack verify` checks whether a pack matches its manifest

`ipfs-pack db` creates (or updates) a temporary ipfs object database

`ipfs-pack serve` starts an ipfs node serving the pack's contents (to IPFS and/or HTTP).

`ipfs-pack bag` convert to and from BagIt (spec-compliant) bags.

`ipfs-pack car` convert to and from a car (certified archive).

`datadex` or maybe `gx-dataset`

`car` - certified archives

The `ipfs repo filestore`

ipfs / notes

Proposing some tooling for datasets (ipfs-pack and stuff) #205

IPFS Tooling for datasets

Background

Grounding Assumptions

Why current IPFS tooling is not enough

Other tooling examples

Proposed Tooling Additions

dagger/dagify

dagger/dagify --only-cid --only-root

ipfs-pack filesystem packing tool

Subcommands

Usage Example

ipfs-pack make create (or update) a pack manifest

ipfs-pack verify checks whether a pack matches its manifest

ipfs-pack db creates (or updates) a temporary ipfs object database

ipfs-pack serve starts an ipfs node serving the pack's contents (to IPFS and/or HTTP).

ipfs-pack bag convert to and from BagIt (spec-compliant) bags.

ipfs-pack car convert to and from a car (certified archive).

datadex or maybe gx-dataset

car - certified archives

The ipfs repo filestore

`dagger/dagify`

`dagger/dagify --only-cid --only-root`

`ipfs-pack` filesystem packing tool

`ipfs-pack make` create (or update) a pack manifest

`ipfs-pack verify` checks whether a pack matches its manifest

`ipfs-pack db` creates (or updates) a temporary ipfs object database

`ipfs-pack serve` starts an ipfs node serving the pack's contents (to IPFS and/or HTTP).

`ipfs-pack bag` convert to and from BagIt (spec-compliant) bags.

`ipfs-pack car` convert to and from a car (certified archive).

`datadex` or maybe `gx-dataset`

`car` - certified archives

The `ipfs repo filestore`