lseg - Githubissues

By default, OCRopus 4 uses word segmentation and word recognition, not line segmentation.

The published training data is automatically derived using other OCR engines and contains lots of errors. They tend to go away after multiple rounds of EM (semi-supervised/self-supervised training).

For both word and line segmentation, there are three classes: geometric center (nucleus), periphery, and background. Each nucleus determines a text line. For each nucleus, only pixels in the periphery are included. If a connected periphery contains only one nucleus, then all pixels are assigned to the word representing that nucleus. If a connected periphery contains multiple nuclei, pixels in the periphery are assigned to the nearest nucleus.

The algorithm works reasonably well with touching text lines. Given the training data we have, the segmentation isn't precise, but it's good enough for recognition. With better training data, the same framework could be used for pixel-exact segmentation.

The algorithm doesn't care about orientation; if you train it on vertical or curved text, it will segment that kind of text just fine. We have used it for scene text as well.

It seems to generalize quite well to different classes of documents; that mainly depends on available training data and augmentations used during training. By default, we train with UW3, G1000, Tobacco Corpus, and public documents from the Archive. There's a lot more to be done to optimize the training set composition and the augmentations.

ocropus-archive / ocropus4-old

lseg #7