Refactor to improve usability of the corpus

anjackson commented 9 years ago

Based on this feedback, we should consider refactoring the corpus.

Certainly the tools should be moved out or kept in a separate top-level folder. I think they have already been copied into the 'fidget' codebase, so I can check that and then remove it.

While I appreciate that the metadata files clutter things up as they are, I still like the idea of keeping the metadata close to the files. This is because it helps track who contributed what, and makes updating the metadata easier. Rather than completely separate them, how about a compromise.

Instead of putting metadata alongside each individual file, we collect it at the top-level of each collection, and we make the top-level of each collection consistent. Using the variations collection as an example, we switch to a standard layout like this:

ebooks/README.md   - Contains human-readable textual information
ebooks/metadata.md   - Contains metadata about the items in this collection
ebooks/data/   -   Contains the actual sample files

So, you can reliably get to the test files by looking at */data/ from whatever the parent directory is.

I'd still like the option to include tool output, as we can't assume that we will able to reliably re-run tools in the future. Following Ross's suggestion, we could arrange the top-level like this:

/corpora/ - Parent folder for corpora, e.g. /corpora/ebooks/
/scripts/ - Scripts that run tools and other processes.
/tool-results/   -   Contains sample tool output.

There are some other points I'd like to revisit.

I'd like to be able to include 3rd party corpora either as e.g. git submodules. For example, there are lots of interesting files in the test corpus that someone set up for the fine-free-file command: https://git.fedorahosted.org/cgit/file-tests.git/tree/db (see also https://fedorahosted.org/file-tests/) We can't necessarily distribute these corpora, but it would be nice to be able to make them easy to plug in.

Secondly, the idea was always meant to be that the metadata would be used to generate static web pages that let you explore the content (e.g. hosted via GitHub pages). The longer term idea was to add a continuous integration hook (e.g. Travis-CI) that runs tools and tests over the corpus and add that to the generated pages. I'd be interesting in knowing if anyone else is interested in that approach.

anjackson commented 9 years ago

Note that any refactoring should be done on a fork first. Apart from anything else, SCAPE deliverables may be referencing individual resources in this data set, and so changing the structure would break those links.

anjackson commented 9 years ago

Note feedback thus far:

openpreserve / format-corpus

Refactor to improve usability of the corpus #9