process multiple html files with hocr2djvused

jwilk-archive / ocrodjvu

OCR for DjVu

GNU General Public License v2.0

45 stars 19 forks source link

process multiple html files with hocr2djvused #1

Closed jwilk closed 12 years ago

jwilk commented 12 years ago

Issue reported by @thkoch2001:

Hi,

I ran tesseract manually on multiple image files (try GNU Parallel!) and ended up with one html (hocr) file for every page. To combine those html pages to one djvused script I hacked your hocr2djvused a bit.

My version now optionally also accepts input file parameters and processes those as consecutive pages.

You can find my changes here: https://github.com/thkoch2001/ocrodjvu/commit/318657e4a45bb8c8002e06382b73d49e984c0f30

jwilk commented 12 years ago

The patch doesn't look crazy, but at least documentation would have to be updated (lib/cli/hocr2djvused.py:31 and doc/hocr2djvused.xml.

Some nitpicking:

I prefer lst += x to lst.extend(x).
Please keep indentation consistent with the rest of code.

jwilk commented 12 years ago

Implemented in f9922a64007475af87494464804cfb7155e80ccc.

jwilk commented 12 years ago

Fixed in 0.7.11.