Implement two changes to make it easier to segment individual words and lines out of the text mask. With the initial U-Net model adjacent words and lines were prone to running together, making them difficult to separate. The plan is to feed the detection results into a model which processes text lines, so the main separation that matters is between adjacent lines.
Use opencv-python to erode the masks for each word in the target mask. This helps create more clearly defined boundaries between adjacent text instances
Add differentiable binarization (https://arxiv.org/abs/1911.08947). Not all elements of this paper are implemented, just the core element consisting of the probability / threshold head separation and the differentiable binarization function.
Adding DB introduces parameters and computations into the model which are only needed at training time. To optimize inference speed, an eval_only option has been added to DetectionModel which controls whether training-only parameters are added, and module.training checks have been added in the forwards pass to skip unnecessary computation during inference.
Below are some comparison images of 7a22d8bfc431806736311b6b6a9624cf9d931695 (+erosion, +data augmentation +higher res mask, -db) vs 0f673888c5e6e150dd556199a47e8d4d4b4cec23 (+erosion, +data augmentation +higher res mask, +db). The main difference to note is that a sequence of words are less often combined into the same region in the version with DB.
There is a downside to adding DB, which is that the loss starts out much higher and takes more epochs to come down to a low value (around 0.10), so it might be worth exploring ways to speed up convergence or other ways to increase separation around the boundaries of text elements.
Implement two changes to make it easier to segment individual words and lines out of the text mask. With the initial U-Net model adjacent words and lines were prone to running together, making them difficult to separate. The plan is to feed the detection results into a model which processes text lines, so the main separation that matters is between adjacent lines.
Adding DB introduces parameters and computations into the model which are only needed at training time. To optimize inference speed, an
eval_only
option has been added toDetectionModel
which controls whether training-only parameters are added, andmodule.training
checks have been added in theforwards
pass to skip unnecessary computation during inference.Below are some comparison images of 7a22d8bfc431806736311b6b6a9624cf9d931695 (+erosion, +data augmentation +higher res mask, -db) vs 0f673888c5e6e150dd556199a47e8d4d4b4cec23 (+erosion, +data augmentation +higher res mask, +db). The main difference to note is that a sequence of words are less often combined into the same region in the version with DB.
There is a downside to adding DB, which is that the loss starts out much higher and takes more epochs to come down to a low value (around 0.10), so it might be worth exploring ways to speed up convergence or other ways to increase separation around the boundaries of text elements.
Without differentiable binarization:
With differentiable binarization:
Without differentiable binarization (2):
With differentiable binarization (2) :