Closed zuphilip closed 6 years ago
Done. Thank you, Philipp and Konstantin, for the contribution and the review.
Should we tag a new release based on master? 1.3.0
?
The script could be extended to create two new hOCR files for left and right page, too.
New release sounds good, but there is already one drafted. Sorry forgot about this. Maybe we can do two new releases 1.2.1 and 1.3.0?
Improving the script sounds fine, also I expect that after cutting a double page into two single pages, it might be better to run OCR on each of those again.
Let's start with 1.2.1, then create 1.3.0.
Running OCR again on the single pages is reasonable, but can cost a lot of resources if many pages have to be processed, so separated hOCR from the initial double pages can be desired in certain situations.
This cuts a page (horizontally) into two pages in the middle such that the most of the bounding boxes are separated nicely, e.g. cutting double pages or double columns.
For example this double pages
is cut in the middle and outputs a left and right page
The whole computation is based on the bounding boxes, and therefore needs the input of some OCR or layout segmentation process. But it might be possible to OCR the individual pages afterwards again to receive better results then (e.g. skewing might be more consistent along one page compared to a double page).