Open tlongers opened 8 years ago
@cr3ative has a good cleaned up version https://github.com/cr3ative/chilcot-html in case that's useful to compare (no sure how it was extracted though)
@igorbrigadir This is the output of all the PDFs concatenated then exported to HTML using Acrobat Pro CC. The index page has an extra search box pointed at GitHub, but that's about it.
Thanks! That might be a good pipeline - might be easier to clean up the HTML result rather than the extracted text from PDF - selecting elements (footers, headers, bullets) by style.
@igorbrigadir good point. Can you let us know how you get on. I've taken a quick look at the @cr3ative version but not done an in depth comparison vs simple text extraction.
PDF to text creates jumbled text sometimes e.g.
Original source:
Markdown formatted preview:
Correct the issue when formatting in MD, or attempt re-extraction and hope for a better outcome?