Open dan-zheng opened 3 years ago
Prior Work:
NLTK datasets (downloading, unzipping, parsing)
I suggest splitting out transforms into a seperate issue. Possibly also sperating out parsing into a seperate issue. Both are huge
Other prior work: torchvision. Still, I would say that this is somewhat low priority, because I don't expect we'll be able to make a big splash in the hyper-optimized space of standard ML models.
That makes sense, thanks!
@dan-zheng I think a nice option here would be to write bindings to https://en.wikipedia.org/wiki/Apache_Arrow
https://github.com/huggingface/datasets has a ton of datasets in this form. It seems a bit crazy to rewrite this sort of infrastructure for each language.
Motivation
Create a structured datasets library within Dex:
lib/datasets.dx
.The library should enable straightforward usage of machine learning datasets, including the following:
/tmp
or~/.dex/datasets/...
)List (inputSize => Float & labelSize => Int)
Implementation ideas
IO
effect.wget
a named dataset with library-hardcoded URL to~/.dex/datasets/...
if it doesn't already exist.Accum
effect for MapReduce-like functionality and potential for parallelism.Prior work
createDirectoryIfMissing(at:)
,download(from:to:)
,extractArchive(at:to:fileExtension:deleteArchiveWhenDone:)