Closed bartvm closed 9 years ago
Here's an idea: we could require that all datasets provide a list of files in the $FUEL_DATA_PATH
root directory which they need as their data source along with checksums for these files. Optionally, the datasets could also point to a method that can be used to retrieve missing files.
We would then write a module that checks whether all required files for a given dataset are available, and do one of the following:
".old"
) and attempt to re-download them.Some of the module's behaviour could be configured. For instance, files that do not check out could raise a warning without being re-downloaded, and that warning could also completely be disabled.
I think that pretty much covers everything. I agree that it's important that everything should be configurable: Whether datasets should be automatically downloaded, whether or not to accept files when the checksum doesn't match, etc.
Downloading should be pretty straightforward with e.g. the requests library. For some datasets we might want to have checksums without being able to provide a download source though (e.g. for datasets that aren't public, like Penn Treebank).
One question to raise is whether we should also automatically process the data. For MNIST we can just read the image files directly, because there is little overhead, but for larger image datasets it might make sense to load them into an HDF5 file once, and from thereon read that file instead. Should we do that automatically? It might be harder to checksum these files, because a different h5py
version might just result in a different file.
Are fuel-download
and fuel-convert
considered sufficient to solve this issue, or do we still want fully-automated downloading?
Nope, I think this is good enough. Doing everything automagically just makes things needlessly complicated (e.g. what if you launch multiple jobs and they all start downloading simultaneously, or it might just end up downloading things whenever you set the data path incorrectly).
Copy from https://github.com/bartvm/blocks/issues/105
Integrating with
skdata
might not be necessary anymore now that we have a separate data framework. But we should still come up with a solution one way or the other.