mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
751 stars 131 forks source link

Trainable Reading Order #484

Closed rohanchn closed 1 year ago

rohanchn commented 1 year ago

Hi @mittagessen!

I am checking here for updates on trainable reading order. Has it been implemented? Is there some documentation on this that I can follow?

rohanchn commented 1 year ago

Okay, I checked the reading_order branch, and think will use that. Closing this for now.

mittagessen commented 1 year ago

There were a couple of things interceding the last weeks so I haven't had time to work a lot on it but the training side works (ketos rotrain). The new XML parser that extracts the reading order for training needs to be rewritten to work with complex orders though and the inference side is completely missing for now (there's a half-finished decoder in lib/segmentation.py but the model management around and all the data pathways to get the RO from segmentation to serialization are missing for now).

rohanchn commented 1 year ago

Thanks. I looked at the code but hit a brick wall as the bridge between segtrain and rotrain wasn't clear.

rohanchn commented 1 year ago

rewritten to work with complex orders

I will appreciate this a lot since I am working with some Bengali periodicals that have two facing pages scanned as a single image, like in this pdf [very heavy file]. Each page has two columns, so I treat each image as having four columns and label each column as a separate <Text Block> with <Main:column#number>. The model I have trained is good at inferring regions and baselines, and also orders lines reasonably well as it is, but there are cases where each column has an associated margin text region, and in these cases it almost always fails. I wonder if trainable reading order can address this.

I don't want to use scantailor to manually process these files as each volume has over 1500 pages with several volumes in each periodical, and I am unable to write a rule to split each image into two as page boundary is not really consistent. I do crop out the white background by extracting the maximum area contour.

What I understand from the paper is that the new implementation can potentially alleviate the troubles I have in handling poetry texts though.

mittagessen commented 1 year ago

The trainable RO should be able to deal with something like your example even in the current state. When I'm talking about complex orders I mean deeply nested ones with unordered and ordered elements, elements on different levels (regions, lines, segments, characters), repeating elements. As long as your order is encoded in the implicit order of elements in the XML the current code should work well.

rohanchn commented 1 year ago

It does work well, except in cases I explained above. Plus, few elements I have do repeat. Perhaps, I need to label more examples. I'll report if this doesn't work. Thank you!