Closed mariababich closed 2 years ago
@mariababich TrOCR is designed for single-line text recognition. You need to use a text detector to get textlines.
Yes, you can combine TrOCR with CRAFT for instance:
@NielsRogge I just tried to use CRAFT but it's using torch < 1.0 which makes it impossible? So bard recommended paddleocr. Please let me know what you think. My final goal is to do exactly this, ocr on multiline text but my inputs are handwritten homework assignments for school kids.
Hi @nyck33 you can try https://github.com/fcakyon/craft-text-detector which is a packaged and more up-to-date version of CRAFT
@NielsRogge thanks! It does look more up-to-date but I was getting the model_urls
error so referenced this: https://github.com/clovaai/CRAFT-pytorch/issues/191, tried downgrading torchvision to 0.13 and deleting those 2 lines and now I'm getting
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[5], line 4
1 craft = Craft(output_dir=output_dir, crop_type="poly", cuda=True)
3 # apply craft text detection and export detected regions to output directory
----> 4 prediction_result = craft.detect_text(image_path)
6 #unload models from ram/gpu
7 craft.unload_craftnet_model()
File /mnt/d/chatgpt/ocr/craft-text-detector/craft_text_detector/__init__.py:131, in Craft.detect_text(self, image, image_path)
128 image = image_path
130 # perform prediction
--> 131 prediction_result = get_prediction(
132 image=image,
133 craft_net=self.craft_net,
134 refine_net=self.refine_net,
135 text_threshold=self.text_threshold,
136 link_threshold=self.link_threshold,
137 low_text=self.low_text,
138 cuda=self.cuda,
139 long_size=self.long_size,
140 )
142 # arange regions
143 if self.crop_type == "box":
...
--> 415 polys = np.array(polys)
416 for k in range(len(polys)):
417 if polys[k] is not None:
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (31,) + inhomogeneous part.
Output is truncated. View as a [scrollable element](command:cellOutput.enableScrolling?6c1494cc-9da4-4d41-ad77-c5b933872a97) or open in a [text editor](command:workbench.action.openLargeOutput?6c1494cc-9da4-4d41-ad77-c5b933872a97). Adjust cell output [settings](command:workbench.action.openSettings?%5B%22%40tag%3AnotebookOutputLayout%22%5D)...
for the basic usage example in that repo and for the advanced:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[6], line 24
21 craft_net = load_craftnet_model(cuda=True)
23 # perform prediction
---> 24 prediction_result = get_prediction(
25 image=image,
26 craft_net=craft_net,
27 refine_net=refine_net,
28 text_threshold=0.7,
29 link_threshold=0.4,
30 low_text=0.4,
31 cuda=True,
32 long_size=1280
33 )
35 # export detected text regions
36 exported_file_paths = export_detected_regions(
37 image=image,
38 regions=prediction_result["boxes"],
39 output_dir=output_dir,
40 rectify=True
41 )
File /mnt/d/chatgpt/ocr/craft-text-detector/craft_text_detector/predict.py:91, in get_prediction(image, craft_net, refine_net, text_threshold, link_threshold, low_text, cuda, long_size, poly)
89 # coordinate adjustment
...
--> 415 polys = np.array(polys)
416 for k in range(len(polys)):
417 if polys[k] is not None:
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (31,) + inhomogeneous part.
Output is truncated. View as a [scrollable element](command:cellOutput.enableScrolling?0f6fa27f-da18-4605-a011-ebf8c3411d9b) or open in a [text editor](command:workbench.action.openLargeOutput?0f6fa27f-da18-4605-a011-ebf8c3411d9b). Adjust cell output [settings](command:workbench.action.openSettings?%5B%22%40tag%3AnotebookOutputLayout%22%5D)...
I'll make note that I tried out a bunch and KerasOCR so far was the best at drawing bounding boxes around handwritten text images. I also tried Donut on Hugging Face but the results were disappointing.
Hi, @nyck33, I am going through exactly the same project as you have done. Could you share your recent insights as to which handwritten text detector worked best for your images? I'd appreciate your help. Thank you!
You won't like my answer but for me since it's a part of an app, I went with Cloud Vision on gcp. ChatGPT wrote my code to make the API calls.
Get Outlook for Androidhttps://aka.ms/AAb9ysg
From: bit-scientist @.> Sent: Wednesday, August 30, 2023 6:08:55 PM To: microsoft/unilm @.> Cc: Kim, Nobutaka @.>; Mention @.> Subject: Re: [microsoft/unilm] [TrOCR] How to run inference on multiline text image (Issue #628)
Hi, @nyck33https://github.com/nyck33, I am going through exactly the same project as you have done. Could you share your recent insights as to which handwritten text detector worked best for your images? I'd appreciate your help. Thank you!
— Reply to this email directly, view it on GitHubhttps://github.com/microsoft/unilm/issues/628#issuecomment-1698787930, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGAFZKLTABR3BYZUXLB57JLXX37KPANCNFSM5O6K6P5A. You are receiving this because you were mentioned.Message ID: @.***>
Oh, I see, thanks @nyck33. Are you using Cloud vision for text detection only or for both (detection+recognition)? How is it doing in terms of CER rate?
I've trained a YOLOv5 model specifically for detecting both handwritten and printed texts. After that, I extract and forward the identified handwritten lines from the image to TrOCR for processing.
I think with some extra work TrOCR can also be used for multiline text image, Based on my experiments crnn_for_text_with_multiple_lines, To make TrOCR suitable for multiline text image, one need to:
And multiline text also means you need much more training samples than single-line. Also the input image and output sequence will be larger which means you need much more GPUs to do the work
In some situation. text line detection is hard e.g. curved text, so I think it is meaningful to train a multiline-version TrOCR which reduce the need for text line detection
Hello!
I am wondering how to run TrOCR for the whole image with a lot of text. The tutorials show how the model works with single line images. When tried to run it on image with a lot of text - it did not worked. How the inference could be scaled?
Thanks in advance, Mariia.