Open amitdo opened 8 years ago
Thanks Amit, I'll read it, available on DFKI website
Good read, though nothing to add to the spec as far as I could see.
Funny to read the (valid) arguments for such a format based on standards and common sense in the hope that it would be "persuasive enough to induce both commercial vendors and researchers to standardize on a single processing and interchange format." and that the "full specification of the hOCR format and sample documents are available at xxxxxx"
And here we are, ten years later with engine-specific formats, competing engine-agnostic formats and digitization projects still developing new formats.
I read it too. Thanks for the link! :)
ALTO is not mentioned there.
I remembered that Thomas Breuel did compare hOCR to ALTO somewhere, now I found it in a hOCR group Tom opened. https://groups.google.com/forum/#!forum/hocr https://groups.google.com/forum/#!topic/hocr/S6MC53lA5-o
And here we are, ten years later with engine-specific formats, competing engine-agnostic formats and digitization projects still developing new formats.
In the link I gave you earlier there is a link to an article about the PAGE format...,
Not invented here syndrome
In https://groups.google.com/forum/#!topic/hocr/S6MC53lA5-o Tom Breuel argues that ALTO would go away because it's complicated and hOCR is the less complex and more versatile spec. However little to no activity in hocr means low adoption and spotty compliance, at least for publicly available tools. For many of the advanced features there are no samples at all.
There was some discussion in 2012 on further development https://groups.google.com/forum/#!topic/hocr/voddaLIBFSs but it fizzled out quietly.
While I'd love to use hOCR to its advantage (image link/Javascript/CSS/manual post-correction right in the file), I'm hesitant to develop more tools for an essentially abandoned format.
I started a wiki page hOCR Bibliography with the information from this issue.
Thanks @zuphilip, also thanks @amitdo for https://github.com/kba/hocr-spec/wiki/hOCR-producers-&-consumers, that is very useful, we could add publications and implementations to the spec document.
... we could add publications and implementations to the spec document.
For me it is a little easier to just copy and paste in the wiki.
Maybe I should change the wiki title to 'hOCR implementers'?
Shorter is always better IMHO, esp. if it's a URL. "hOCR" is redundant, I find "Publications" and "Software" good titles.
Another real-world usage by @christopher-johnson https://github.com/blumenbach/modeller/blob/c4c8b60cebe088b275cecf59623a3fd4630b23cd/modeller-hocr/src/main/java/org/blume/modeller/hOCRData.java
@zuphilip, any objection to changing the title of 'hOCR producers & consumers?'
Do you prefer 'Software' or 'Implementations'?'
This wiki page was IMO created by @kba and he should say something to your question.
This wiki page was IMO created by @kba
No :)
and he should say something to your question.
He already did...
Ah, you created the page itself. Sorry, I am a little distracted by the conference here... I agree that "Software" and "Publications" are convincing short and more easier to have some idea about what the page is about. 👍
I changed the title to Software.
In this case, I see that the previous edits were preserved in the interface of the global history.
zuphilip Add ocrodjvu and hocrjs on Oct 22 4528bed amitdo Create 'hOCR producers & consumers' wiki page on Oct 5 84a72d8
Written by Thomas Breuel https://www.researchgate.net/publication/232632963_The_hOCR_Microformat_for_OCR_Workflow_and_Results_PDF Maybe you want to try to get the full text...