iterative / dataset-registry

Dataset registry DVC project
68 stars 40 forks source link
data-science dataset dvc example machine-learning registry

DVC Dataset Registry

This DVC Data Registry is a centralized place to manage raw data files for use in other example DVC projects, such as https://github.com/iterative/example-get-started.

Installation

Start by cloning the project:

$ git clone https://github.com/iterative/dataset-registry
$ cd dataset-registry

This DVC project comes with a preconfigured DVC remote storage to hold all of the datasets. This is a read-only HTTP remote.

$ dvc remote list
storage https://remote.dvc.org/dataset-registry

Important: To be able to push to the default remote, overwrite it with:

$ dvc remote add -d --local storage s3://dvc-public/remote/dataset-registry

This requires having configured corresponding S3 credentials locally.

Testing data synchronization locally

If you'd like to test commands like dvc push, that require write access to the remote storage, the easiest way would be to set up a "local remote" on your file system:

This kind of remote is located in the local file system, but is external to the DVC project.

$ mkdir -P /tmp/dvc-storage
$ dvc remote add local /tmp/dvc-storage

You should now be able to run:

$ dvc push -r local

Datasets

The folder structure of this project groups datasets corresponding to the external projects they pertain to. After cloning and using dvc pull to download data under DVC control, the workspace should look like this:

$ tree
.
├── README.md
├── get-started
│   └── data.xml.dvc  # Dataset used in iterative/example-get-started
├── mnist
│   └── raw.dvc       # Dataset used in iterative/dvc-get-started
├── fashion-mnist
    └── raw.dvc       # Dataset used in iterative/dvc-get-started