Glitchy source text - correct, or re-extract?

official-inquiries / uk-iraq-inquiry

Parliament's inquiry into the UK's involvement in the Iraq War

2 stars 3 forks source link

Glitchy source text - correct, or re-extract? #4

Open tlongers opened 8 years ago

tlongers commented 8 years ago

PDF to text creates jumbled text sometimes e.g.

Original source: the-report-of-the-iraq-inquiry_introduction_pdf__page_2_of_19_

Markdown formatted preview:

introduction_md__80__

Correct the issue when formatting in MD, or attempt re-extraction and hope for a better outcome?

igorbrigadir commented 8 years ago

@cr3ative has a good cleaned up version https://github.com/cr3ative/chilcot-html in case that's useful to compare (no sure how it was extracted though)

cr3ative commented 8 years ago

@igorbrigadir This is the output of all the PDFs concatenated then exported to HTML using Acrobat Pro CC. The index page has an extra search box pointed at GitHub, but that's about it.

igorbrigadir commented 8 years ago

Thanks! That might be a good pipeline - might be easier to clean up the HTML result rather than the extracted text from PDF - selecting elements (footers, headers, bullets) by style.

rufuspollock commented 8 years ago

@igorbrigadir good point. Can you let us know how you get on. I've taken a quick look at the @cr3ative version but not done an in depth comparison vs simple text extraction.