kba / hocr-spec

The hOCR Embedded OCR Workflow and Output Format
http://kba.github.io/hocr-spec/1.2/
73 stars 20 forks source link

Old paper about hOCR #18

Open amitdo opened 8 years ago

amitdo commented 8 years ago

Written by Thomas Breuel https://www.researchgate.net/publication/232632963_The_hOCR_Microformat_for_OCR_Workflow_and_Results_PDF Maybe you want to try to get the full text...

kba commented 8 years ago

Thanks Amit, I'll read it, available on DFKI website

kba commented 8 years ago

Good read, though nothing to add to the spec as far as I could see.

Funny to read the (valid) arguments for such a format based on standards and common sense in the hope that it would be "persuasive enough to induce both commercial vendors and researchers to standardize on a single processing and interchange format." and that the "full specification of the hOCR format and sample documents are available at xxxxxx"

And here we are, ten years later with engine-specific formats, competing engine-agnostic formats and digitization projects still developing new formats.

amitdo commented 8 years ago

I read it too. Thanks for the link! :)

ALTO is not mentioned there.

I remembered that Thomas Breuel did compare hOCR to ALTO somewhere, now I found it in a hOCR group Tom opened. https://groups.google.com/forum/#!forum/hocr https://groups.google.com/forum/#!topic/hocr/S6MC53lA5-o

And here we are, ten years later with engine-specific formats, competing engine-agnostic formats and digitization projects still developing new formats.

In the link I gave you earlier there is a link to an article about the PAGE format...,

amitdo commented 8 years ago

Not invented here syndrome

kba commented 8 years ago

https://xkcd.com/927/

kba commented 8 years ago

In https://groups.google.com/forum/#!topic/hocr/S6MC53lA5-o Tom Breuel argues that ALTO would go away because it's complicated and hOCR is the less complex and more versatile spec. However little to no activity in hocr means low adoption and spotty compliance, at least for publicly available tools. For many of the advanced features there are no samples at all.

There was some discussion in 2012 on further development https://groups.google.com/forum/#!topic/hocr/voddaLIBFSs but it fizzled out quietly.

While I'd love to use hOCR to its advantage (image link/Javascript/CSS/manual post-correction right in the file), I'm hesitant to develop more tools for an essentially abandoned format.

zuphilip commented 8 years ago

I started a wiki page hOCR Bibliography with the information from this issue.

kba commented 8 years ago

Thanks @zuphilip, also thanks @amitdo for https://github.com/kba/hocr-spec/wiki/hOCR-producers-&-consumers, that is very useful, we could add publications and implementations to the spec document.

zuphilip commented 8 years ago

... we could add publications and implementations to the spec document.

For me it is a little easier to just copy and paste in the wiki.

amitdo commented 8 years ago

Maybe I should change the wiki title to 'hOCR implementers'?

kba commented 8 years ago

Shorter is always better IMHO, esp. if it's a URL. "hOCR" is redundant, I find "Publications" and "Software" good titles.

kba commented 7 years ago

Another real-world usage by @christopher-johnson https://github.com/blumenbach/modeller/blob/c4c8b60cebe088b275cecf59623a3fd4630b23cd/modeller-hocr/src/main/java/org/blume/modeller/hOCRData.java

amitdo commented 7 years ago

@zuphilip, any objection to changing the title of 'hOCR producers & consumers?'

Do you prefer 'Software' or 'Implementations'?'

zuphilip commented 7 years ago

This wiki page was IMO created by @kba and he should say something to your question.

amitdo commented 7 years ago

This wiki page was IMO created by @kba

No :)

and he should say something to your question.

He already did...

zuphilip commented 7 years ago

Ah, you created the page itself. Sorry, I am a little distracted by the conference here... I agree that "Software" and "Publications" are convincing short and more easier to have some idea about what the page is about. 👍

amitdo commented 7 years ago

I changed the title to Software.

In this case, I see that the previous edits were preserved in the interface of the global history.

zuphilip Add ocrodjvu and hocrjs on Oct 22 4528bed amitdo Create 'hOCR producers & consumers' wiki page on Oct 5 84a72d8