open-mmlab / mmocr

OpenMMLab Text Detection, Recognition and Understanding Toolbox
https://mmocr.readthedocs.io/en/dev-1.x/
Apache License 2.0
4.27k stars 743 forks source link

Multiple words output are connected without space between them #813

Closed EMRAN-SALEH-CORSEARCH closed 2 years ago

EMRAN-SALEH-CORSEARCH commented 2 years ago

Hi ,

I use:

FCENet ('FCE_CTW_DCNv2') as a text detector and [Show, Attend and Read ( 'SAR')] as a text recognizer,

When I do inferenc: 'FCE_CTW_DCNv2' predicts the words next to each other in one text area (polygon). When the text recognizer predicts the text of the text polygon, the output words come connected. For example, if the text is: Hello World! The output of the text recognition is: HelloWorld! a connected phrase without a white space. How can we overcome this issue to get the right output?

gaotongxiao commented 2 years ago

Pretrained recognizers in MMOCR are trained on words without any space inside, so they cannot generate the space character in inference time. A workaround is to use other detection models that tightly bound the single words and use our utils https://mmocr.readthedocs.io/en/latest/api.html#mmocr.utils.stitch_boxes_into_lines as a postprocess step.

xinke-wang commented 2 years ago

If you do not want to change detection results or train another detector, there is a solution to use some off-the-shelf word segment techniques such as 'wordsegment' as a post-process to split the recognition results. (e.g. input 'HelloWorld!' -> output ['Hello', 'World', '!']).

EMRAN-SALEH-CORSEARCH commented 2 years ago

Yeah, I do not want to change the detector. 'FCE_CTW_DCNv2' gives me the best results.

You mean or train another recognizer with white space, right? and I should represent the white space with a token not in the current dictionary.

word segment techniques would not be the best option if the words are not from the english dictionary. Thank you!

gaotongxiao commented 2 years ago

Sounds good as long as you have sufficient data for recognition containing white spaces.