Open rufuspollock opened 11 years ago
See (very old) code at https://github.com/okfn/oed
Email to okfn-labs list about this on 20th June: http://lists.okfn.org/pipermail/okfn-labs/2013-June/000913.html
Lots of valuable feedback and follow up including from @stefanw
inspired by Tabula (http://tabula.nerdpower.org) I started working on table extraction from images in PDFs with the goal to define a structure template for a page and then use that template on subsequent pages: https://github.com/stefanw/carpenter
Carpenter uses OpenCV on images to detect tables and tesseract on the table cells to extract text and limits the set of characters to digits/punctuation if the cell contains likely a number value. With such tricks I had moderately promising success, but did not continue further.
I wanted Carpenter to become a web interface (think Refine) for structured OCR extraction tasks, template definition and OCR training. It's far from finished, but may be a starting point if you want to work on the topic. email
From @tfmorris:
Everything that gets uploaded to the Internet Archive (should) get OCR'd automatically. You can see all the different file formats here: https://ia600401.us.archive.org/7/items/oed01arch/
The PDF, ePUB, and DJVu formats should all have text in some form or another. email
...
Below is the text for a few entries as OCR'd by the default Internet Archive OCR (an old version of Abby FineReader). This was extracted from the ePub file, but if you wanted to work with the Internet Archive version of the OCR, you'd want to start with the Abby version because it contains more info (and perhaps convert it to hOCR as described in Rod Page's post http://iphylo.blogspot.com/2011/07/correcting-ocr-using-hocr-firefox.html). To see the original page image look at column 3 here: http://archive.org/stream/oed01arch#page/4/mode/1up email
... The Abby version is one of the formats in the directory. Look for the file that ends _abby.gz There's also a torrent containing all the files if that's easier. email
The abby file for the first volume is https://ia600401.us.archive.org/7/items/oed01arch/oed01arch_abbyy.gz - it's massive though (322Mb gzipped).
322MB isn't massive. If you want massive, start from the 2GB JP2 images! :-) https://ia600401.us.archive.org/7/items/oed01arch/oed01arch_jp2.zip (I'd love to see how Tesseract does compared to IA's existing FineReader OCR version).
I just made a gist of Rod Page's ABBY FineReader XML to hOCR XSLT which may be useful if you find hOCR easier to work with: https://gist.github.com/tfmorris/5977784
@tfmorris You're right - that ain't massive ;-) Thanks to your assistance I've got it downloaded and starting to work on it :-)
READ FIRST
Work is now in progress - see the this repository https://github.com/okfn/oed
Original Info
[Originally started in 2007]
From http://lists.okfn.org/pipermail/okfn-discuss/2007-November/000635.html:
Steps
See also http://wiki.okfn.org/OedFullText
Originally at: http://ideas.okfn.org/ideas/20/oxford-english-dictionary-1st-ed-full-text-online/ and before that was originally recorded as: http://knowledgeforge.net/okfn/tasks/ticket/285