This DVC Data Registry is a centralized place to manage raw data files for use in other example DVC projects, such as https://github.com/iterative/example-get-started.
Start by cloning the project:
$ git clone https://github.com/iterative/dataset-registry
$ cd dataset-registry
This DVC project comes with a preconfigured DVC remote storage to hold all of the datasets. This is a read-only HTTP remote.
$ dvc remote list
storage https://remote.dvc.org/dataset-registry
Important: To be able to push to the default remote, overwrite it with:
$ dvc remote add -d --local storage s3://dvc-public/remote/dataset-registry
This requires having configured corresponding S3 credentials locally.
If you'd like to test commands like dvc push
,
that require write access to the remote storage, the easiest way would be to set
up a "local remote" on your file system:
This kind of remote is located in the local file system, but is external to the DVC project.
$ mkdir -P /tmp/dvc-storage
$ dvc remote add local /tmp/dvc-storage
You should now be able to run:
$ dvc push -r local
The folder structure of this project groups datasets corresponding to the
external projects they pertain to.
After cloning and using dvc pull
to download data
under DVC control, the workspace should look like this:
$ tree
.
├── README.md
├── get-started
│ └── data.xml.dvc # Dataset used in iterative/example-get-started
├── mnist
│ └── raw.dvc # Dataset used in iterative/dvc-get-started
├── fashion-mnist
└── raw.dvc # Dataset used in iterative/dvc-get-started