jbenet / data

package manager for datasets
59 stars 1 forks source link

local blobstore? global? #2

Open jbenet opened 10 years ago

jbenet commented 10 years ago

data-blob doc describing more copied below.

Should there be a local blobstore separate from the working directory datasets? Should it be global?

Implications:

data blob - Manage blobs in the blobstore.

    Managing blobs means:

      put <hash>    Upload blob named by <hash> to blobstore.
      get <hash>    Download blob named by <hash> from blobstore.
      check <hash>  Verify blob contents named by <hash> match <hash>.
      show <hash>   Output blob contents named by <hash>.

    What is a blob?

    Datasets are made up of files, which are made up of blobs.
    (For now, 1 file is 1 blob. Chunking to be implemented)
    Blobs are basically blocks of data, which are checksummed
    (for integrity, de-duplication, and addressing) using a crypto-
    graphic hash function (sha1, for now). If git comes to mind,
    that's exactly right.

    Local Blobstores

    data stores blobs in blobstores. Every local dataset has a
    blobstore (local caching with links TBI). Like in git, the blobs
    are stored safely in the blobstore (different directory) and can
    be used to reconstruct any corrupted/deleted/modified dataset files.

    Remote Blobstores

    data uses remote blobstores to distribute datasets across users.
    The datadex service includes a blobstore (currently an S3 bucket).
    By default, the global datadex blobstore is where things are
    uploaded to and retrieved from.

    Since blobs are uniquely identified by their hash, maintaining one
    global blobstore helps reduce data redundancy. However, users can
    run their own datadex service. (The index and blobstore are tied
    together to ensure consistency. Please do not publish datasets to
    an index if blobs aren't in that index)

    data can use any remote blobstore you wish. (For now, you have to
    recompile, but in the future, you will be able to) Just change the
    datadex configuration variable. Or pass in "-s <url>" per command.

    (data-blob is part of the plumbing, lower level tools.
    Use it directly if you know what you're doing.)