rufuspollock / climate-negotiations

Information on the UNFCC climate negotiations using the Earth Negotiations Bulletin from the IISD
https://rufuspollock.github.io/climate-negotiations/
3 stars 0 forks source link

First pass on storing the scraped raw text #2

Open rufuspollock opened 8 years ago

rufuspollock commented 8 years ago

As a first pass I'd suggest we store the raw text in a decent form in this repo.

SciPo already have semi-structured raw text based on scrape of ENB (perhaps with some corrections?).

I suggest we do not want to store this SciPO text but transform a bit to nice markdown and then store.

Why?


---
title:
id:    # e.g. 1205000e where file was 1205000e.txt
abstract: 
date:
url:     # source url of ENB from which text came

---

text goes here in markdown form

I suggest we therefore get rid of the odd quasi-html structure (where is this from?) and replace with markdown:

  ::H1::§ § WORKING GROUP I§
  ::BODY::§ § Working Group I

Info Architecture

/enb/{id}.md/

Where {id} is the name of the original txt file minus txt.

Asides

Question: but does this make things harder later e.g. when we want to extract sections for tagging? Not sure it really does - we can parse markdown to html and then do the sectioning (the current txt structure does not really give us sections anyway ...)

pauloborges commented 8 years ago

@rgrp, I've updated the script and ran all raw files on it. Check the resulting script and markdown on my fork here.

If you think the code is good, I'll move it to the main repo.

rufuspollock commented 8 years ago

@pauloborges looks good. I've fixed one issue with ":" in frontmatter items (not allowed unless quoted). However, still a few issues. I also see some non-utf8 control codes for example in enb/enb12395e/index.md - could we look at these and work out if we can fix in some way.

here are errors:

/enb/enb12393e/index.md: (): control characters are not allowed at line 1 column 1 /enb/enb12395e/index.md: (): control characters are not allowed at line 1 column 1 /enb/enb12615e/index.md: (): control characters are not allowed at line 1 column 1

pauloborges commented 8 years ago

Still investigating this...