ocropus / hocr-tools

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.
Other
369 stars 79 forks source link

HTML exporter #24

Open zuphilip opened 8 years ago

zuphilip commented 8 years ago

The hocr files are already html files and can be displayed in any browser. However, they will just display the text without any layout or format information. What do you think about doing some HTML exporter which will display also some of the layout or format information? With the bbox we can show the text at the correct position, see also https://github.com/tmbdev/ocropy/issues/80#issuecomment-177227732

kba commented 8 years ago

This could certainly be done. I'd favor a solution in Javascript though, to be more flexible. And it looks like a fun project, too. I'm currently focussing on validating hocr docs with schematron] (to augment hocr-check), so not sure when there will be time to work on this.

wanghaisheng commented 8 years ago

i can help about schematron related work

kba commented 8 years ago

@wanghaisheng Great, help very welcome! Let's discuss in the hocr-spec gitter, I'll explain what I've done so far later tonight.

ogencoglu commented 8 years ago

Any news on this?

zuphilip commented 8 years ago

See https://github.com/kba/hocrjs

kba commented 8 years ago

I'm working on it (https://github.com/kba/hocrjs) but at the moment I focus on the hOCR spec to get the implementation right.

There's also @jbaiter's hocrviewer-mirador which requires setup but has a great interface.

zuphilip commented 8 years ago

I found https://github.com/ultrasaurus/hocr-javascript which is an approach to overlay the OCR data on the picture by using JavaScript, see e.g. http://rawgit.com/ultrasaurus/hocr-javascript/master/letter.html .

m-art-in commented 2 years ago

Any progress here in the last years? Would would be the state of the art tool to present text and images?

Thanks!

kba commented 2 years ago

Have you tried https://github.com/kba/hocrjs? Otherwise, you can convert to another format like PAGE-XML and use PAGEViewer or Aletheia.

stweil commented 2 years ago

PAGEViewer already supports hOCR, so no need for a conversion.

m-art-in commented 2 years ago

PAGEViewer is standalone application. I am looking for a solution to display image and text as a synptic view on a web page. So far I wanted to use hOCR as data format, but if there are better solutions for another format for such a web representation, I reconsider the decision and take another file format.

What would you suggest?

stweil commented 2 years ago

Then hocrjs could be a good starting point for you.