A collection of interesting datasets and the tools to convert them into ready-to-use formats.
Requires Python 3.8+.
git clone https://github.com/saulpw/readysetdata.git
cd readysetdata
Then from within the repository,
make setup
or
pip install .
or
python3 setup.py install
Output is generated for all available formats and put in the OUTPUT
directory (output/
by default).
Size and time estimates are for JSONL output on a small instance.
make movielens
(150MB, 3 tables, 5 minutes) (2019)make imdb
(20GB, 7 tables, 1 hour; updated daily)make geonames
(500MB, 2 tables, 10 minutes; updated quarterly)make wikipedia
(2.5GB, 3800+ categories, 12 hours; updated monthly)See results immediately as they accumulate in output/wp-infoboxes
.
make tpch
(500MB, 8 tables, 20 seconds; generated randomly)make fakedata
(13MB, 3 tables, 5 seconds; generated randomly)All available formats will be output by default.
Specify a subset of formats by setting the FORMATS
envvar, or pass -f <formats>
to individual scripts.
Separate multiple formats with ,
.
parquet
arrow
and arrows
duckdb
sqlite
These live in the scripts/
directory. Some of them require the readysetdata
module to be installed. For the moment, set PYTHONPATH=.
and run from the toplevel directory.
remote-unzip.py <url> <filename>
Extract <filename>
from .zip file at <url>
, and stream to stdout. Only downloads the one file; does not need to download the entire .zip.
download.py <url>
Download from <url>
and stream to stdout. The data for e.g. https://example.com/path/to/file.csv
will be cached at cache/example.com/path/to/file.csv
.
xml2json.py <tag>
Parse XML from stdin, and emit JSONL to stdout for the given <tag>
.
demux-jsonl.py <field>
Parse JSONL from stdin, and append each JSONL verbatim to its <field-value>.jsonl
.
Created and curated by Saul Pwanson. Licensed for use under Apache 2.0.
Enabled by Apache Arrow and Voltron Data.
Toponymic information is based on the Geographic Names Database, containing official standard names approved by the United States Board on Geographic Names and maintained by the National Geospatial-Intelligence Agency.More information is available at the Resources link at www.nga.mil. TheNational Geospatial-Intelligence Agencyname, initials, and seal are protected by 10 United States Code � Section 425.