Create a Dex datasets library

dan-zheng commented 3 years ago

Motivation

Create a structured datasets library within Dex: lib/datasets.dx.

The library should enable straightforward usage of machine learning datasets, including the following:

[ ] Downloading datasets to a shared machine location (in /tmp or ~/.dex/datasets/...)
[ ] Unzipping datasets, handling various compression formats.
[ ] Parsing: loading the dataset as a Dex data structure.
- Example: input-output pairs. List (inputSize => Float & labelSize => Int)
[ ] Transforms: data transformations - batching, shuffling (nondeterminism), concatenation, filtering, mapping, and augmentation.

Implementation ideas

Dataset downloading and unzipping could be implemented via shell command support in Dex using IO effect.
- Example: wget a named dataset with library-hardcoded URL to ~/.dex/datasets/... if it doesn't already exist.
Parsing could be implemented using parser combinators, or ad-hoc string processing logic.
Transforms could be implemented using Accum effect for MapReduce-like functionality and potential for parallelism.

Prior work

TensorFlow Datasets: downloading & unzipping & transforms
Swift for TensorFlow model support: downloading & unzipping
- createDirectoryIfMissing(at:), download(from:to:), extractArchive(at:to:fileExtension:deleteArchiveWhenDone:)
oxinabox/DataDeps.jl: downloading & unzipping
JuliaML/MLDatasets.jl: parsing & download/unzip based on DataDeps.jl
JuliaText/CorpusLoaders.jl: parsing & download/unzip based on DataDeps.jl
NLTK datasets: downloading & unzipping & parsing

oxinabox commented 3 years ago

Prior Work:

https://github.com/oxinabox/DataDeps.jl (download and unzipping)
https://github.com/JuliaML/MLDatasets.jl/ (parsing + download/unzip based on DataDeps.jl)
https://github.com/JuliaText/CorpusLoaders.jl/ (parsing + download/unzip based on DataDeps.jl)
NLTK datasets (downloading, unzipping, parsing)

I suggest splitting out transforms into a seperate issue. Possibly also sperating out parsing into a seperate issue. Both are huge

apaszke commented 3 years ago

Other prior work: torchvision. Still, I would say that this is somewhat low priority, because I don't expect we'll be able to make a big splash in the hyper-optimized space of standard ML models.

dan-zheng commented 3 years ago

That makes sense, thanks!

srush commented 3 years ago

@dan-zheng I think a nice option here would be to write bindings to https://en.wikipedia.org/wiki/Apache_Arrow

https://github.com/huggingface/datasets has a ton of datasets in this form. It seems a bit crazy to rewrite this sort of infrastructure for each language.

google-research / dex-lang