ipfs / notes

IPFS Collaborative Notebook for Research
MIT License
402 stars 30 forks source link

Proposing some tooling for datasets (ipfs-pack and stuff) #205

Open jbenet opened 7 years ago

jbenet commented 7 years ago

This proposes some tooling for large datasets. Warning! As soon as I wrote it, i already want to change it. in particular, I want to change the db thing to just be a normal ipfs repo. it would help with the serving, too. We just need to lang making ipfs repos super fast with swappable datastores (right now we can't quite do that).

Proposal posted at https://gist.github.com/jbenet/deda429fae2e5af9a86a01b0cbb614f7 and reproduced below for those getting email.

I will update it with the db -> repo thoughts, and update the gist and the comment below. I will comment when i update it so people get a notification, at least.

IPFS Tooling for datasets

Background

We need some tooling for a certain set of use cases around archival and dataset management. This tooling is for fitting how people work with large files and large datasets.

Grounding Assumptions

Basic grounding assumptions here:

Why current IPFS tooling is not enough

The current ipfs tooling assumes we can import all data into a .ipfs repository directory. There are ongoing efforts to build filestore to allow referencing content outside of that directory, but this is not yet finalized, and all metadata is stored in the .ipfs repository, not with the directory in question.

We have often discussed Certified ARchives (.car) as a replacement for tar. This could be a future replacement, along with a reliable way to mount the .cars, but this is not yet here either.

Other tooling examples

Tools for archiving websites:

Proposed Tooling Additions

This document proposes the addition or adjustment of the following tools:

dagger/dagify

This tool (name discussion here) reads in a file or directory and outputs an (in-order) ipld graph, according to a given format string.

> dagger -fmt <fmt-string> -r foo/bar/baz
<ipld-object>
<ipld-object>
<ipld-object>
<ipld-object>
<ipld-object>

Where <fmt-string> is a format string that uniquely determines (for ever) the whole dag structure, including chunking scheme, index layout, what is tracked in the index, what is left as raw nodes, etc. The idea is that this string (which ideally will be short) can uniquely describe a strategy for representing the source content as the output ipld graph, and that it can repeatably do so. Meaning that once a given fmt string produces one output, it should never change (lest there is a major bug). This is because people must retain the ability to verify their content, and they need some primitive to do so.

dagger/dagify --only-cid --only-root

This tool will have an --only-cid flag that ouputs only the cids:

> dagger -fmt <fmt-string> -r foo/bar/baz --only-cid
<ipld-object-cid>
<ipld-object-cid>
<ipld-object-cid>
<ipld-object-cid>
<ipld-object-cid>

And an --only-root flag that returns only the last (root) object or cid.

> dagger -fmt <fmt-string> -r foo/bar/baz --only-root
<last-ipld-object>

> dagger -fmt <fmt-string> -r foo/bar/baz --only-cid --only-root
<last-ipld-cid>

ipfs-pack filesystem packing tool

The idea is that ipfs-pack is a filesystem packing tool, that establishes the notion of a bundle, bag, or "pack" of files. We use pack to avoid confusing it with a Bag from BagIt, a very similar format (that ipfs-pack is compatible with). The way "packs" work is this:

Subcommands

> ipfs-pack -h
USAGE
    ipfs-pack <subcommand> <arguments>

SUBCOMMANDS
    make     makes the package, overwriting the ipfs-pack manifest file.
    verify   verifies the ipfs-pack manifest file is correct.
    db       creates (or updates) a temporary ipfs object database `.ipfs-pack/db`
    serve    starts an ipfs node serving the pack's contents (to IPFS and/or HTTP).
    bag      create BagIt spec-compliant bag from a pack.
    car      create a `.car` certified archive from a pack.

Usage Example

> pwd
/home/jbenet/myPack

> ls
someJSON.json
someXML.xml
moreData/

> ipfs-pack make
> ipfs-pack make -v
wrote PackManifest

> ls
someJSON.json
someXML.xml
moreData/
PackManifest

> cat PackManifest
QmVP2aaAWFe21QjUujMw5hwYRKD1eGx3yYWEBbMtuxpqXs moreData/0
QmV7eDE2WXuwQnvccsoXSzK5CQGXdFfay1LSadZCwyfbDV moreData/1
QmaMY7h9pmTcA5w9S2dsQT5eGLEQ1CwYQ32HwMTXAev5gQ moreData/2
QmQjYU5PscpCHadDbL1fDvTK4P9eXirSwD8hzJbAyrd5mf moreData/3
QmRErwActoLmffucXq7HPtefBC19MjWUcj1DdBoaAnMm6p moreData/4
QmeWvL929Tdhzw27CS5ZVHD73NQ9TT1xvLvCaXCgi7a9YB moreData/5
QmXbzZeh44jJEUueWjFxEiLcfAfzoaKYEy1fMHygkSD3hm moreData/6
QmYL17nYZrZsAhJut5v7ooD9hmz2rBotC1tqC9ZPxzCfer moreData/7
QmPKkidoUYX12PyCuKzehQuhEJofUJ9PPaX2Gc2iYd4GRs moreData/8
QmQAubXA3Gji5v5oaJhMbvmbGbiuwDf1u9sYsN125mcqrn moreData/9
QmYbYduoHMZAUMB5mjHoJHgJ9WndrdWkTCzuQ6yHkbgqkU someJSON.json
QmeWiZD5cdyiJoS3b7h87Cs9G21uQ1sLmeKrunTae9h5qG someXML.xml
QmVizQ5fUceForgWogbb2m2v5RRrE8xEm8uSkbkyNB4Rdm moreData
QmZ7iEGqahTHdUWGGZMUxYRXPwSM3UjBouneLcCmj9e6q6 .

> ipfs-pack db make
> ipfs-pack db make -v
wrote .ipfs-pack/db

> ls -a
./
../
.ipfs-pack/
someJSON.json
someXML.xml
moreData/
PackManifest

> find .ipfs-pack/
.ipfs-pack/
.ipfs-pack/db

ipfs-pack make create (or update) a pack manifest

This command creates (or updates) the pack's manifest file.

ipfs-pack make
# wrote PackManifest

ipfs-pack verify checks whether a pack matches its manifest

This command checks whether a pack matches its PackManifest.

# errors when there is no manifest
> random-files foo
> cd foo
> ipfs-pack verify
error: no PackManifest found

# succeeds when manifest and pack match
> ipfs-pack make
> ipfs-pack verify

# errors when manifest and pack do not match
> echo "QmVizQ5fUceForgWogbb2m2v5RRrE8xEm8uSkbkyNB4Rdm non-existent-file1" >>PackManifest
> echo "QmVizQ5fUceForgWogbb2m2v5RRrE8xEm8uSkbkyNB4Rdm non-existent-file2" >>PackManifest
> touch non-manifest-file3
> ipfs-pack verify
error: in manifest, missing from pack: non-existent-file1
error: in manifest, missing from pack: non-existent-file2
error: in pack, missing from manifest: non-manifest-file3

ipfs-pack db creates (or updates) a temporary ipfs object database

This command creates (or updates) a temporary ipfs object database (eg at .ipfs-pack/db). This object database contains positonal metadata for all IPLD objects contained in the pack. (It follows the ipfs repo filestore metadata concerns). It MAY be a different, simpler object-db format, or be a full-fledged ipfs node repo using filestore.

The db is a simple key-value store that supports:

Notes:

This database basically implements:

type PackObjectDB interface {  
  // Make creates or updates a pack-db at packdbPath, 
  // with data for all the objects in the pack at packPath.
  Make(packPath string, packdbPath string) error

  // Put associates the given FileDescriptor with the given ipld.CID
  // if filestore.Descriptor is nil, Put removes the entry for ipld.CID (rm)
  Put(ipld.CID, filestore.Descriptor) error

  // Get retrieves the FileDescriptor associated with the given ipld.CID
  Get(ipld.CID) (filestore.Descriptor, error)

  // List returns all ipld.CID stored in the database
  List() (<-chan ipld.CID, error)

  // Delete deletes all the database contents and clears all files
  Delete() error
}

And does so both through a programmatic interface (some go package), or via cli tooling:

> ipfs-pack-db --help
USAGE
    ipfs-pack-db <subcommand> <arguments>

SUBCOMMANDS
    make     creates (or updates) the pack-db for a pack directory
    list     lists all cids in the pack-db
    put      adds a (cid, filestore-descriptor) entry.
    get      retrieves the filestore-descriptor for a given cid.
    delete   removes all files representing the pack-db (destructive)

ipfs-pack serve starts an ipfs node serving the pack's contents (to IPFS and/or HTTP).

This command starts an ipfs node serving the pack's contents (to IPFS and/or HTTP). This command MAY require a full go-ipfs installation to exist. It MAY be a standalone binary (ipfs-pack-serve). It MUST use an ephemeral node or a one-off node whose id would be stored locally, in the pack, at <pack-root>/.ipfs-pack/repo

> ipfs-pack serve --http
Serving pack at /ip4/0.0.0.0/tcp/1234/http - http://127.0.0.1:1234

> ipfs-pack serve --ipfs
Serving pack at /ip4/0.0.0.0/tcp/1234/ipfs/QmPVUA4rJgckcf1ifrZF5KvwV1Uib5SGjJ7Z5BskEpTaSE

ipfs-pack bag convert to and from BagIt (spec-compliant) bags.

This command converts between BagIt (spec-compliant) bags, a commonly used archiving format very similar to ipfs-pack. It works like this:

> ipfs-pack bag --help
USAGE
  ipfs-pack-bag <src-pack> <dst-bag>
  ipfs-pack-bag <src-bag> <dst-pack>

# convert from pack to bag
> ipfs-pack bag path/to/mypack path/to/mybag

# convert from bag to pack
> ipfs-pack bag path/to/mybag path/to/mypack

ipfs-pack car convert to and from a car (certified archive).

This command converts between packs and cars (certified archives). It works like this:

> ipfs-pack car --help
USAGE
  ipfs-pack-car <src-pack> <dst-car>
  ipfs-pack-car <src-car> <dst-pack>

# convert from pack to car
> ipfs-pack car path/to/mypack path/to/mycar.car

# convert from car to pack
> ipfs-pack car path/to/mycar.car path/to/mypack

datadex or maybe gx-dataset

WIP

a tool to prepare and publish a dataset (as an ipfs-pack, guides user to add dataset metadata and license info, and publishes to a registry)

car - certified archives

WIP

cars would interop with packs.

The ipfs repo filestore

WIP

Maybe the ipfs repo filestore abstractions can leverage ipfs-packs to understand what is being tracked in a given directory, particularly if those packs have up-to-date local dbs of all their objects.

jbenet commented 7 years ago

Question about BagIt: how does it do hashes of directories / the tree? i didnt see that when looking at the spec, but i may just have missed it.

jbenet commented 7 years ago

the db thing would be super useful for large things, it could just be a local .ipfs repo-- just we'd need to train go-ipfs to leverage repos it finds in directories when adding, or tracking such a "subrepo". something like a "ingest this repo" but without copying the data, just does a union. need to figure out how to make those subrepo accesses fast.

flyingzumwalt commented 7 years ago

@edsu can we entice you to comment? Your input might make all the difference.

flyingzumwalt commented 7 years ago

Lol. "ipfs-pack your bags" That's very clever @jbenet

edsu commented 7 years ago

@jbenet it's true empty directories are not present in the BagIt manifest. In practice some folks who have wanted to preserve the presence of empty directories have created an empty .keep file in the directory or documented the directories' presence in the bag-Info.txt

I'm really interested to learn more about what the IPLD representation would look like. I am definitely not up on all the IPFS features/functionality. Would the putDescriptor allow people to add metadata about their packages, such as a name for the dataset, who created it, etc?

My understanding is that IPFS is largely file oriented. Is it fair to say that this proposal adds the notion of sets of files and tools for working with them?

You are probably familiar with them already, but this makes me think of two otherpoint of reference for work in this area that you might be interested in:

I suspect both would be interested in the work you are proposing.

edsu commented 7 years ago

Oh, and of course @maxogden's Dat comes to mind. I know you are talking already. I wonder if some of this higher level of abstraction and tooling around datasets could be handled by an IPFS enabled Dat?

rufuspollock commented 7 years ago

@jbenet I don't know how familiar with some of the Frictionless Data stuff and esp Data Package - http://specs.frictionlessdata.io/data-packages/?

I know you have commented a bit a couple of years ago on the Frictionless Data stuff when you were looking at data package managers and we discussed it at some length in London last year. In general, I'd say if you are looking at a simple structure on disk for describing a "package" of data it is would be a good fit.

I should say I did not immediately grok what exactly you are up to here from the description above e.g. what is an .ipfs repository (and how it relates to overall design of ipfs).

flyingzumwalt commented 7 years ago

Another important reference: pairtrees -- these are commonly used in the archives space. It's the tool that many digital preservation teams reach for first when they want to preserve a lot of bits.

Pairtree is a filesystem hierarchy for holding objects that are located by mapping identifier strings to object directory (or folder) paths two characters at a time. Pairtrees have the advantage that many object operations, including backup and restore, can be performed with native operating system tools.

rht commented 7 years ago

(redirected from https://github.com/ipfs/archives/issues/96#issuecomment-272924005)

If I read the draft proposal correctly:

I also made some sample packaged files which each contains a build script and ipld-based file list for the manifest (where details like the ones in json-schema can be added, e.g. mime-type). Ref: https://github.com/ipfs/archives/pull/101 (where I assume the default handler for software packages is gx).

I have made sure to find the generic common ground among metadata fields and spec from dat, frictionlessdata, json-schema (have yet to fold in bagit, .torrent file, warc, pairtree -- imo it is clearest if there is a comparison matrix table among all of these standards).

flyingzumwalt commented 7 years ago

@edsu commented

My understanding is that IPFS is largely file oriented.

The messaging and the name IPFS mislead people into thinking IPFS is just for files. It's actually a content-addressed protocol for distributing Merkle DAGs, which you can use to represent anything. That's why IPLD is so important -- it gives us a basic structure for representing any data structure as a DAG that can be written directly to IPFS and addressed using IPLD paths.

Is it fair to say that this proposal adds the notion of sets of files and tools for working with them?

ipfs-pack is a move in that direction. For the first pass, ipfs-pack will help us support the use case where users Use Manifest Files to Track Directory Structure & Contents, which allows us to Track a Directory and Serve it on IPFS without making duplicate local copies of the data. This will eventually allow us to Round-trip whole directories through IPFS and Mount directories by auto-detecting their ipfs-pack manifests or prebuilt object databases

That gives a very strong starting point for using IPLD to properly represent sets of files, but there will certainly be more work to establish the best metadata patterns.

rht commented 7 years ago

The draft proposal was made after ipld's existence, but in the example, in the content of the PackManifest in https://github.com/ipfs/notes/issues/205#issue-197357094 is misleading since it is not in ipld format, hence no file attributes.

rufuspollock commented 7 years ago

@flyingzumwalt have you taken a look at the Data Package specs -- it seems the basic Data Package could act as a reasonable match for the manifest file here.

http://specs.frictionlessdata.io/data-packages/