[TrOCR] How to run inference on multiline text image

mariababich commented 2 years ago

Hello!

I am wondering how to run TrOCR for the whole image with a lot of text. The tutorials show how the model works with single line images. When tried to run it on image with a lot of text - it did not worked. How the inference could be scaled?

Thanks in advance, Mariia.

wolfshow commented 2 years ago

@mariababich TrOCR is designed for single-line text recognition. You need to use a text detector to get textlines.

NielsRogge commented 2 years ago

Yes, you can combine TrOCR with CRAFT for instance:

CRAFT can handle the text detection
TrOCR can handle the text recognition.

nyck33 commented 1 year ago

@NielsRogge I just tried to use CRAFT but it's using torch < 1.0 which makes it impossible? So bard recommended paddleocr. Please let me know what you think. My final goal is to do exactly this, ocr on multiline text but my inputs are handwritten homework assignments for school kids.

NielsRogge commented 1 year ago

Hi @nyck33 you can try https://github.com/fcakyon/craft-text-detector which is a packaged and more up-to-date version of CRAFT

nyck33 commented 1 year ago

@NielsRogge thanks! It does look more up-to-date but I was getting the model_urls error so referenced this: https://github.com/clovaai/CRAFT-pytorch/issues/191, tried downgrading torchvision to 0.13 and deleting those 2 lines and now I'm getting

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 4
      1 craft = Craft(output_dir=output_dir, crop_type="poly", cuda=True)
      3 # apply craft text detection and export detected regions to output directory
----> 4 prediction_result = craft.detect_text(image_path)
      6 #unload models from ram/gpu
      7 craft.unload_craftnet_model()

File /mnt/d/chatgpt/ocr/craft-text-detector/craft_text_detector/__init__.py:131, in Craft.detect_text(self, image, image_path)
    128     image = image_path
    130 # perform prediction
--> 131 prediction_result = get_prediction(
    132     image=image,
    133     craft_net=self.craft_net,
    134     refine_net=self.refine_net,
    135     text_threshold=self.text_threshold,
    136     link_threshold=self.link_threshold,
    137     low_text=self.low_text,
    138     cuda=self.cuda,
    139     long_size=self.long_size,
    140 )
    142 # arange regions
    143 if self.crop_type == "box":
...
--> 415         polys = np.array(polys)
    416         for k in range(len(polys)):
    417             if polys[k] is not None:

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (31,) + inhomogeneous part.
Output is truncated. View as a [scrollable element](command:cellOutput.enableScrolling?6c1494cc-9da4-4d41-ad77-c5b933872a97) or open in a [text editor](command:workbench.action.openLargeOutput?6c1494cc-9da4-4d41-ad77-c5b933872a97). Adjust cell output [settings](command:workbench.action.openSettings?%5B%22%40tag%3AnotebookOutputLayout%22%5D)...

for the basic usage example in that repo and for the advanced:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[6], line 24
     21 craft_net = load_craftnet_model(cuda=True)
     23 # perform prediction
---> 24 prediction_result = get_prediction(
     25     image=image,
     26     craft_net=craft_net,
     27     refine_net=refine_net,
     28     text_threshold=0.7,
     29     link_threshold=0.4,
     30     low_text=0.4,
     31     cuda=True,
     32     long_size=1280
     33 )
     35 # export detected text regions
     36 exported_file_paths = export_detected_regions(
     37     image=image,
     38     regions=prediction_result["boxes"],
     39     output_dir=output_dir,
     40     rectify=True
     41 )

File /mnt/d/chatgpt/ocr/craft-text-detector/craft_text_detector/predict.py:91, in get_prediction(image, craft_net, refine_net, text_threshold, link_threshold, low_text, cuda, long_size, poly)
     89 # coordinate adjustment
...
--> 415         polys = np.array(polys)
    416         for k in range(len(polys)):
    417             if polys[k] is not None:

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (31,) + inhomogeneous part.
Output is truncated. View as a [scrollable element](command:cellOutput.enableScrolling?0f6fa27f-da18-4605-a011-ebf8c3411d9b) or open in a [text editor](command:workbench.action.openLargeOutput?0f6fa27f-da18-4605-a011-ebf8c3411d9b). Adjust cell output [settings](command:workbench.action.openSettings?%5B%22%40tag%3AnotebookOutputLayout%22%5D)...

nyck33 commented 1 year ago

I'll make note that I tried out a bunch and KerasOCR so far was the best at drawing bounding boxes around handwritten text images. I also tried Donut on Hugging Face but the results were disappointing.

bit-scientist commented 1 year ago

Hi, @nyck33, I am going through exactly the same project as you have done. Could you share your recent insights as to which handwritten text detector worked best for your images? I'd appreciate your help. Thank you!

nyck33 commented 1 year ago

You won't like my answer but for me since it's a part of an app, I went with Cloud Vision on gcp. ChatGPT wrote my code to make the API calls.

Get Outlook for Androidhttps://aka.ms/AAb9ysg

From: bit-scientist @.> Sent: Wednesday, August 30, 2023 6:08:55 PM To: microsoft/unilm @.> Cc: Kim, Nobutaka @.>; Mention @.> Subject: Re: [microsoft/unilm] [TrOCR] How to run inference on multiline text image (Issue #628)

Hi, @nyck33https://github.com/nyck33, I am going through exactly the same project as you have done. Could you share your recent insights as to which handwritten text detector worked best for your images? I'd appreciate your help. Thank you!

— Reply to this email directly, view it on GitHubhttps://github.com/microsoft/unilm/issues/628#issuecomment-1698787930, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGAFZKLTABR3BYZUXLB57JLXX37KPANCNFSM5O6K6P5A. You are receiving this because you were mentioned.Message ID: @.***>

bit-scientist commented 1 year ago

Oh, I see, thanks @nyck33. Are you using Cloud vision for text detection only or for both (detection+recognition)? How is it doing in terms of CER rate?

anandhuh1234 commented 8 months ago

I've trained a YOLOv5 model specifically for detecting both handwritten and printed texts. After that, I extract and forward the identified handwritten lines from the image to TrOCR for processing.

myhub commented 8 months ago

I think with some extra work TrOCR can also be used for multiline text image, Based on my experiments crnn_for_text_with_multiple_lines, To make TrOCR suitable for multiline text image, one need to:

regenerate or label training samples with multiline text
retrain the model with a larger input image size (e.g. 512*512px)

And multiline text also means you need much more training samples than single-line. Also the input image and output sequence will be larger which means you need much more GPUs to do the work

In some situation. text line detection is hard e.g. curved text, so I think it is meaningful to train a multiline-version TrOCR which reduce the need for text line detection

microsoft / unilm

[TrOCR] How to run inference on multiline text image #628