Documentation for loading custom dataset

RahulBhalley commented 4 years ago

Is there any documentation to load custom dataset (in different domains) in S4TF like we can do in PyTorch?

For instance, ImageFolder lets us load any dataset containing images structured in the following directories:

root/dog/xxx.png
root/dog/xxy.png
root/dog/xxz.png

root/cat/123.png
root/cat/nsdf3.png
root/cat/asd932_.png

I looked at some protocols (for image) and structures (for text) for data loading:

But they seem difficult to use without documentation. Any help?

BradLarson commented 4 years ago

The lack of documentation / generalization may be a result of how we arrived at the current designs. The dataset wrappers we use now started as a series of bespoke datasets that were coded for individual examples, which then were lifted out into reusable datasets, and then generalized at a higher level via the Epochs interface. The order emerged from a bottom-up consolidation of common parts as we found them.

Now that we have a shared abstraction in Epochs, it might be useful to start documenting datasets and the data pipeline in a more centralized location. There are the documentation comments in the Epochs components, but a dedicated guide would definitely help. I think we've been relying on the swift-models example datasets as templates that people can follow to port across their own specific datasets. That can work, but it doesn't explain the general process.

Additionally, having a general loader that could parse directories of files and automatically populate labels and images for classification would definitely be possible with the components we have now. I wonder if the Imagenette / Imagewoof datasets we have now could even be implemented on top of that and save some code. That might be worth looking into.

For now, the easiest way to build something for your custom dataset would be to start with the Imagenette dataset as a template and customize that for your specific dataset archive and directory hierarchy.

RahulBhalley commented 4 years ago

Okay. Would you like to have a similar approach to PyTorch's data loading with ImageLoader? I am interested in this approach because it's generally how labeled datasets are structured, even in audio domain (I think Google's NSynth also follows similar approach of structuring data with each directory representing a certain class). I can work on this and make a draft PR if you guys agree to merge such work.

8bitmp3 commented 4 years ago

@BradLarson Do you want to get the community involved in building the docs? Some of use, especially @rahulbhalley, have been testing S4TF things extensively. Maybe we can help

BradLarson commented 4 years ago

@rahulbhalley - Due to how types are used with Epochs, I'm not sure how far this could be generalized, but I can definitely see a use case for creating a generalized image classification dataset loader that operated on directories. An optionally-supplied mapping between directories and labels for that would also make sense. If you had a workable design for this, I'd welcome it.

@8bitmp3 - Help with documentation is always appreciated. This seems like something that might take the form of a Readme in the Datasets directory or some other longer-form piece of documentation for Epochs and datasets work. We have bits of this scattered in headers, presentations, and other documents, and aggregating that into a central location would be beneficial.

RahulBhalley commented 3 years ago

S4TF has been archived so I'm closing this issue I created.

tensorflow / swift-models

Documentation for loading custom dataset #661