rufuspollock / ideas

Ideas for (tech) stuff to research, build or work on.
https://rufuspollock.com/
50 stars 4 forks source link

OED 1st Edition - Extract and Online #50

Open rufuspollock opened 11 years ago

rufuspollock commented 11 years ago

READ FIRST

Work is now in progress - see the this repository https://github.com/okfn/oed

[Originally started in 2007]

From http://lists.okfn.org/pipermail/okfn-discuss/2007-November/000635.html:

Kragen Sitaker did amazing work back in 2005/2006 'liberating' the OED first edition which is now (mostly) in the public domain 2. He posted up fairly good scans of volumes 1-6 on archive.org (see 2). However at the time he was unable to do much on the OCR front (no doubt because of the poor performance of open source OCR, particularly on such a complex text as the OED which has lots of non-standard english and font changes). With the better open source OCR engine it would be possible to convert the OED back into text and perhaps wikify it to allow for gradual proof-editing and correction.

http://lists.canonical.org/pipermail/kragen-tol/2006-March/000816.html

Steps

See also http://wiki.okfn.org/OedFullText

Originally at: http://ideas.okfn.org/ideas/20/oxford-english-dictionary-1st-ed-full-text-online/ and before that was originally recorded as: http://knowledgeforge.net/okfn/tasks/ticket/285

rufuspollock commented 11 years ago

See (very old) code at https://github.com/okfn/oed

rufuspollock commented 11 years ago

Email to okfn-labs list about this on 20th June: http://lists.okfn.org/pipermail/okfn-labs/2013-June/000913.html

Lots of valuable feedback and follow up including from @stefanw

inspired by Tabula (http://tabula.nerdpower.org) I started working on table extraction from images in PDFs with the goal to define a structure template for a page and then use that template on subsequent pages: https://github.com/stefanw/carpenter

Carpenter uses OpenCV on images to detect tables and tesseract on the table cells to extract text and limits the set of characters to digits/punctuation if the cell contains likely a number value. With such tricks I had moderately promising success, but did not continue further.

I wanted Carpenter to become a web interface (think Refine) for structured OCR extraction tasks, template definition and OCR training. It's far from finished, but may be a starting point if you want to work on the topic. email

From @tfmorris:

Everything that gets uploaded to the Internet Archive (should) get OCR'd automatically. You can see all the different file formats here: https://ia600401.us.archive.org/7/items/oed01arch/

The PDF, ePUB, and DJVu formats should all have text in some form or another. email

...

Below is the text for a few entries as OCR'd by the default Internet Archive OCR (an old version of Abby FineReader). This was extracted from the ePub file, but if you wanted to work with the Internet Archive version of the OCR, you'd want to start with the Abby version because it contains more info (and perhaps convert it to hOCR as described in Rod Page's post http://iphylo.blogspot.com/2011/07/correcting-ocr-using-hocr-firefox.html). To see the original page image look at column 3 here: http://archive.org/stream/oed01arch#page/4/mode/1up email

... The Abby version is one of the formats in the directory. Look for the file that ends _abby.gz There's also a torrent containing all the files if that's easier. email

rufuspollock commented 11 years ago

The abby file for the first volume is https://ia600401.us.archive.org/7/items/oed01arch/oed01arch_abbyy.gz - it's massive though (322Mb gzipped).

tfmorris commented 11 years ago

322MB isn't massive. If you want massive, start from the 2GB JP2 images! :-) https://ia600401.us.archive.org/7/items/oed01arch/oed01arch_jp2.zip (I'd love to see how Tesseract does compared to IA's existing FineReader OCR version).

I just made a gist of Rod Page's ABBY FineReader XML to hOCR XSLT which may be useful if you find hOCR easier to work with: https://gist.github.com/tfmorris/5977784

rufuspollock commented 11 years ago

@tfmorris You're right - that ain't massive ;-) Thanks to your assistance I've got it downloaded and starting to work on it :-)