ryanfb / loebolus-data

Data for Loebolus
https://ryanfb.github.io/loebolus/
The Unlicense
29 stars 12 forks source link

Add OCR for existing PDF's #2

Open ryanfb opened 10 years ago

tmr83 commented 3 years ago

Is this still planned or if I fork the project and submit the OCR data, will it be accepted?

ryanfb commented 3 years ago

I'll gladly take submissions!

tmr83 commented 3 years ago

In that case, I will look around for the best and easiest OCR software available on Linux and run the first book through it, dumping the output into a .txt file, proofread it, and make a pull request.

Should I make a new directory or more? For example, have each book renamed to it's proper title having each pdf and its ocr data in a .txt file with it? I am not a programmer. I was an English major, but if your perl script isn't too difficult to modify, I can update it with the new filenames and paths.

This repository has been on my mind because I added it to my long term goals to read all of these books. An additional goal of mine is to learn to port them into epubs.

ryanfb commented 3 years ago

What this issue was originally targeting is adding an OCR text layer to the existing PDF's, so that the PDF's themselves should be searchable/copy-pasteable. The real difficulty is in doing so accurately for these books, which have both English as well as Ancient Greek and Latin. It may be possible with just Tesseract (I personally worked on building some Latin support for Tesseract), but I haven't checked to see if the accuracy is at an acceptable level, or the best way to apply it across all these files.

tmr83 commented 3 years ago

I used an online OCR software once for another project, and it was highly accurate. I was much more impressed by it than tesseract which I have on my system, but I was looking around at other options on Linux.

What software do you use to modify the PDFs? I can add the OCR text layer to them to make them searchable/copy-pasteable. I intended on working in .txt files porting to xhtml and epub. From my experience, I am usually not happy with Google copied PDFs nor Project Gutenberg/Internet Archive ebook conversions.

As for reading Ancient Greek and Latin, it is something I can handle even tho my knowledge of Latin is lost. I am able to handle Linguistics well. I am also confident that I can work on those languages on my Linux system. I see no issue in being able to proofread and type Greek and Latin.

ryanfb commented 3 years ago

Tesseract has a PDF output option, which will add the OCR text layer. Another challenge though is that it does not have a PDF input option, so the PDF's need to be converted to images then merged back together (I know that the pdftk command-line utility will work with this without mangling the text layer). Though this also adds the challenge of doing all this while trying to ensure that the final PDF's don't balloon in size or lose quality due to the multiple conversions involved.