internetarchive / archive-hocr-tools

Efficient hOCR tooling
Other
38 stars 9 forks source link

mention hocr_to_pdf in README, along with other undocumented tools #7

Closed jrochkind closed 1 year ago

jrochkind commented 1 year ago

The pdf_to_hocr functionality is super useful for a variety of possible work/data flows (including manual correction workflows), and I'm not aware of any other open source implementation. I didn't realize it was in here at first -- even though I had glanced at this project -- until I was looking at source code for pdfcomp and saw it call this. I figured it was worth mentioning in the README, along with mentioning more prominently the existence of other currently un-doc'd utilities the reader might want to look at.

Thanks for the code!

MerlijnWajer commented 1 year ago

@kba - I think what @jrochkind meant was pdf-to-hocr (https://github.com/internetarchive/archive-hocr-tools/blob/master/bin/pdf-to-hocr) which does the opposite: it takes a PDF with text layers and creates hOCR files from it. The title of the PR is conflating the conversion too, but the commit is clear about it.

btw, unrelated, but in my experience hocr-pdf wasn't fit for the use cases I had in mind, and Tesseract did a much better job at generating text layers (which is why I pretty much verbatim copied it in Python)

MerlijnWajer commented 1 year ago

@jrochkind - thanks!

jrochkind commented 1 year ago

Thanks!

Yes, in the other direction, rendering hocr to text positioned in a PDF -- I have evaluated a few other open source implementations, and haven't found any that do as good a job as archive-pdf-tools.

I think I haven't actually found any others that can, for instance, properly position diagonally angled text -- tesseract can, and archive-pdf-tools with a tesseract HOCR can as well. (although with incorrect line heights in my test, so maybe not a complete success).

At some point I'll publish my lengthy notes on all this on my blog.

So archive-pdf-tools does a better job with a tesseract-generated HOCR than any other open source tool I've found.

However, as I reported in https://github.com/internetarchive/archive-pdf-tools/issues/63 , archive-pdf-tools with a tesseract HOCR is still, surprisingly, not doing as good as job as tesseract itself does. It makes me wonder if tesseract is using information that doesn't actually get serialized to it's own hOCR. But I haven't tried to debug.