robertknight / ocrs-models

PyTorch models for the ocrs OCR engine
45 stars 8 forks source link

Implement differentiable binarization and mask erosion #2

Closed robertknight closed 9 months ago

robertknight commented 2 years ago

Implement two changes to make it easier to segment individual words and lines out of the text mask. With the initial U-Net model adjacent words and lines were prone to running together, making them difficult to separate. The plan is to feed the detection results into a model which processes text lines, so the main separation that matters is between adjacent lines.

Adding DB introduces parameters and computations into the model which are only needed at training time. To optimize inference speed, an eval_only option has been added to DetectionModel which controls whether training-only parameters are added, and module.training checks have been added in the forwards pass to skip unnecessary computation during inference.

Below are some comparison images of 7a22d8bfc431806736311b6b6a9624cf9d931695 (+erosion, +data augmentation +higher res mask, -db) vs 0f673888c5e6e150dd556199a47e8d4d4b4cec23 (+erosion, +data augmentation +higher res mask, +db). The main difference to note is that a sequence of words are less often combined into the same region in the version with DB.

There is a downside to adding DB, which is that the loss starts out much higher and takes more epochs to come down to a low value (around 0.10), so it might be worth exploring ways to speed up convergence or other ways to increase separation around the boundaries of text elements.

Without differentiable binarization:

With differentiable binarization:

Without differentiable binarization (2):

With differentiable binarization (2) :

robertknight commented 9 months ago

Improving recognition and layout accuracy is currently the main focus, so this will be revisited later.