ocropus / hocr-tools

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.
Other
364 stars 79 forks source link

Add new script hocr-cut for cutting a page #108

Closed zuphilip closed 6 years ago

zuphilip commented 7 years ago

This cuts a page (horizontally) into two pages in the middle such that the most of the bounding boxes are separated nicely, e.g. cutting double pages or double columns.

For example this double pages

litver

is cut in the middle and outputs a left and right page

The whole computation is based on the bounding boxes, and therefore needs the input of some OCR or layout segmentation process. But it might be possible to OCR the individual pages afterwards again to receive better results then (e.g. skewing might be more consistent along one page compared to a double page).

stweil commented 6 years ago

Done. Thank you, Philipp and Konstantin, for the contribution and the review.

stweil commented 6 years ago

Should we tag a new release based on master? 1.3.0?

stweil commented 6 years ago

The script could be extended to create two new hOCR files for left and right page, too.

zuphilip commented 6 years ago

New release sounds good, but there is already one drafted. Sorry forgot about this. Maybe we can do two new releases 1.2.1 and 1.3.0?

Improving the script sounds fine, also I expect that after cutting a double page into two single pages, it might be better to run OCR on each of those again.

stweil commented 6 years ago

Let's start with 1.2.1, then create 1.3.0.

Running OCR again on the single pages is reasonable, but can cost a lot of resources if many pages have to be processed, so separated hOCR from the initial double pages can be desired in certain situations.