Automated data downloading

mila-iqia / fuel

A data pipeline framework for machine learning

MIT License

867 stars 268 forks source link

Automated data downloading #6

Closed bartvm closed 9 years ago

bartvm commented 9 years ago

Copy from https://github.com/bartvm/blocks/issues/105

We could use the https://github.com/jaberg/skdata to download popular datasets and even to load them into memory.

Integrating with skdata might not be necessary anymore now that we have a separate data framework. But we should still come up with a solution one way or the other.

vdumoulin commented 9 years ago

Here's an idea: we could require that all datasets provide a list of files in the $FUEL_DATA_PATH root directory which they need as their data source along with checksums for these files. Optionally, the datasets could also point to a method that can be used to retrieve missing files.

We would then write a module that checks whether all required files for a given dataset are available, and do one of the following:

If all files are present and their checksums check out, everything's fine.
If some existing files' checksums do not check out, rename them (e.g. by appending ".old") and attempt to re-download them.
If some files are not present, attempt to download them.
If there's not script provided to download missing or damaged files, raise an error.

Some of the module's behaviour could be configured. For instance, files that do not check out could raise a warning without being re-downloaded, and that warning could also completely be disabled.

bartvm commented 9 years ago

I think that pretty much covers everything. I agree that it's important that everything should be configurable: Whether datasets should be automatically downloaded, whether or not to accept files when the checksum doesn't match, etc.

Downloading should be pretty straightforward with e.g. the requests library. For some datasets we might want to have checksums without being able to provide a download source though (e.g. for datasets that aren't public, like Penn Treebank).

One question to raise is whether we should also automatically process the data. For MNIST we can just read the image files directly, because there is little overhead, but for larger image datasets it might make sense to load them into an HDF5 file once, and from thereon read that file instead. Should we do that automatically? It might be harder to checksum these files, because a different h5py version might just result in a different file.

vdumoulin commented 9 years ago

Are fuel-download and fuel-convert considered sufficient to solve this issue, or do we still want fully-automated downloading?

bartvm commented 9 years ago

Nope, I think this is good enough. Doing everything automagically just makes things needlessly complicated (e.g. what if you launch multiple jobs and they all start downloading simultaneously, or it might just end up downloading things whenever you set the data path incorrectly).