Open rufuspollock opened 8 years ago
@rgrp, I've updated the script and ran all raw files on it. Check the resulting script and markdown on my fork here.
If you think the code is good, I'll move it to the main repo.
@pauloborges looks good. I've fixed one issue with ":" in frontmatter items (not allowed unless quoted). However, still a few issues. I also see some non-utf8 control codes for example in enb/enb12395e/index.md - could we look at these and work out if we can fix in some way.
here are errors:
/enb/enb12393e/index.md: (
Still investigating this...
As a first pass I'd suggest we store the raw text in a decent form in this repo.
SciPo already have semi-structured raw text based on scrape of ENB (perhaps with some corrections?).
I suggest we do not want to store this SciPO text but transform a bit to nice markdown and then store.
Why?
Document Structure
I suggest we therefore get rid of the odd quasi-html structure (where is this from?) and replace with markdown:
Info Architecture
/enb/{id}.md/
Where
{id}
is the name of the original txt file minus txt.Asides
Question: but does this make things harder later e.g. when we want to extract sections for tagging? Not sure it really does - we can parse markdown to html and then do the sectioning (the current txt structure does not really give us sections anyway ...)