microsoft / OmniParser

A simple screen parsing tool towards pure vision based GUI agent
Creative Commons Attribution 4.0 International
4.98k stars 382 forks source link

easyocr is useless in demo, why use it? #5

Open MOSSV2 opened 1 month ago

nmstoker commented 1 month ago

Do you have an alternative recommendation?

aliencaocao commented 1 month ago

I agree that it is not the best OCR solution, and also often doesn't get the text right for me. I recommend PaddleOCR https://github.com/PaddlePaddle/PaddleOCR/blob/main/README_en.md

EDIT: add some comparison using my own example EasyOCR Image

Text Box ID 0: DStA
Text Box ID 1: Notice
Text Box ID 2: Lecture
Text Box ID 3: Wargame
Text Box ID 4: Scoreboard
Text Box ID 5: Support
Text Box ID 6: aliencaocao
Text Box ID 7: Logout
Text Box ID 8: SCOREBOARD
Text Box ID 9: Find out how others are doing
Text Box ID 10: My Ranking
Text Box ID 11: aliencaocao
Text Box ID 12: Rookie441
Text Box ID 13: Cxo
Text Box ID 14: solved
Text Box ID 15: solved
Text Box ID 16: solved
Text Box ID 17: 6925 pts
Text Box ID 18: 6525 pts
Text Box ID 19: 6325 pts
Text Box ID 20: Rank
Text Box ID 21: Nickname
Text Box ID 22: Solved
Text Box ID 23: Points
Text Box ID 24: Updated
Text Box ID 25: Jordan
Text Box ID 26: 6125
Text Box ID 27: 2022-05-28 12.54.24
Text Box ID 28: 2022-05-27 00.38.23
Text Box ID 29: the_Itg
Text Box ID 30: 5785
Text Box ID 31: 05-27 17.33.57
Text Box ID 32: Korochi
Text Box ID 33: 5580
Text Box ID 34: 2022-05-27 12.48.46
Text Box ID 35: Ins1ght
Text Box ID 36: 2022-05-27 17.07:12
Text Box ID 37: Cryptsaria
Text Box ID 38: 2022-05-27 07.46.09
Text Box ID 39: 2022-05-27 14:13.34
Text Box ID 40: NOOTOOOT
Text Box ID 41: 5060
Text Box ID 42: 2022-05-28 10:18.19
Text Box ID 43: c457b49cb51
Text Box ID 44: 2022-05-27 00.48.05
Text Box ID 45: ItsKaiser
Text Box ID 46: 4620
Text Box ID 47: 2022-05-27 00.31:17
Text Box ID 48: Hot
Text Box ID 49: Olatr
Text Box ID 50: NDles
Text Box ID 51: Drscovery
Icon Box ID 52: the number 2.
Icon Box ID 53: the number 41.
Icon Box ID 54: the number 37.
Icon Box ID 55: the number 38.
Icon Box ID 56: the number 4258.
Icon Box ID 57: the number 3.
Icon Box ID 58: the number 45.
Icon Box ID 59: a dark theme.
Icon Box ID 60: a badge or award.
Icon Box ID 61: the word "Covid-19".
Icon Box ID 62: the number 49.
Icon Box ID 63: the number 49.
Icon Box ID 64: the number 6095.
Icon Box ID 65: the number 10.
Icon Box ID 66: the number 5562.
Icon Box ID 67: a loading or progress indicator.

PaddleOCR Image

Text Box ID 0: Notice
Text Box ID 1: Lecture
Text Box ID 2: Wargame
Text Box ID 3: Scoreboard
Text Box ID 4: Support
Text Box ID 5: O aliencaocao
Text Box ID 6: DSTA
Text Box ID 7: Logout
Text Box ID 8: Pefence Sy'agency
Text Box ID 9: SCOREBOARD
Text Box ID 10: Find out how others are doing.
Text Box ID 11: My Ranking
Text Box ID 12: aliencaocao
Text Box ID 13: Rookie441
Text Box ID 14: Cxo
Text Box ID 15:  55 solved
Text Box ID 16:  51 solved
Text Box ID 17:  50 solved
Text Box ID 18: 6925 pts
Text Box ID 19: 6525 pts
Text Box ID 20: 6325 pts
Text Box ID 21:  Rank
Text Box ID 22: Nickname
Text Box ID 23: Solved
Text Box ID 24: Points
Text Box ID 25: Updated
Text Box ID 26: 4
Text Box ID 27: Jordan
Text Box ID 28: 49
Text Box ID 29: 6125
Text Box ID 30: 2022-05-28 12:54:24
Text Box ID 31: 5
Text Box ID 32: covo
Text Box ID 33: 49
Text Box ID 34: 6095
Text Box ID 35: 2022-05-27 00:38:23
Text Box ID 36: 6
Text Box ID 37: the_Itg
Text Box ID 38: 49
Text Box ID 39: 5785
Text Box ID 40: 2022-05-27 17:33:57
Text Box ID 41: 7
Text Box ID 42: Korochi
Text Box ID 43: 45
Text Box ID 44: 5580
Text Box ID 45: 2022-05-27 12:48:46
Text Box ID 46: 8
Text Box ID 47: Ins1ght
Text Box ID 48: 45
Text Box ID 49: 5525
Text Box ID 50: 2022-05-27 17:07:12
Text Box ID 51: 9
Text Box ID 52: Cryptsaria
Text Box ID 53: 43
Text Box ID 54: 5425
Text Box ID 55: 2022-05-27 07:46:09
Text Box ID 56: 10
Text Box ID 57: dark
Text Box ID 58: 45
Text Box ID 59: 5325
Text Box ID 60: 2022-05-27 14:13:34
Text Box ID 61: 11
Text Box ID 62: NOOTOOOT
Text Box ID 63: 41
Text Box ID 64: 5060
Text Box ID 65: 2022-05-28 10:18:19
Text Box ID 66: 12
Text Box ID 67: c457b49cb51
Text Box ID 68: 38
Text Box ID 69: 4825
Text Box ID 70: 2022-05-27 00:48:05
Text Box ID 71: 13
Text Box ID 72: ItsKaiser
Text Box ID 73: 37
Text Box ID 74: 4620
Text Box ID 75: 2022-05-27 00:31:17
Icon Box ID 76: the number 2.
Icon Box ID 77: the number 3.
Icon Box ID 78: the logo of the Airports Authority of India.
Icon Box ID 79: a badge or award.
Icon Box ID 80: a loading or progress indicator.

It's obvious that easyocr failed to detect some of the textboxes, causing some to be wrongly classified as icons by yolo, while others are just totally left out. This is especially with the numbers on the webpage. PaddleOCR also uses CPU and rans as fast as easyocr on my AMD 5800x. It also in a way speed things up greatly as less texts are wrongly classified as icons, which means less predict calls to the heavier image captioning model.

nmstoker commented 1 month ago

Thanks @aliencaocao . Looks like it only needs modest changes to integrate a different OCR into utils.py, eg here:

https://github.com/microsoft/OmniParser/blob/d7708e830a2f16a40a8970211a9911fa776d4aa2/utils.py#L375

Is that correct and was it straightforward when you switched out EasyOCR for the demo above?

aliencaocao commented 1 month ago

Yes, I can make a PR if you wish, I already have it impl in my fork: https://github.com/aliencaocao/OmniParser/commit/a3fe0e11c5b67a6ed8de916cdd77ddffba4d135f

nmstoker commented 1 month ago

Thanks for sharing that!

I'll try it out.

Incidentally I agree switching to a better OCR seems best but for anyone after a quick fix in the Gradio demo, I found that bumping the value for text_threshold in easyocr_args down from 0.9 to 0.7 or so helped reduce a fair amount of missing or prematurely split text.

https://github.com/microsoft/OmniParser/blob/d7708e830a2f16a40a8970211a9911fa776d4aa2/gradio_demo.py#L68

aliencaocao commented 1 month ago

Setting to 0.7 or 0.5 both did not make things better for this particular example: Image I still think paddleocr is a clear better alternative.

nmstoker commented 1 month ago

Yes, I was agreeing with you 🙂

The quick fix isn't going to work for everything, but it did slightly improve several of my images where the full width of a text span wasn't being captured.

In your image above this happens with the timestamps: in your earlier version item 31 is missing the 2022 plus five entries under the Solved and Points columns were missed and above only one is missed. However it's still not handling the entries in the Rank column well enough.

yadong-lu commented 1 month ago

Yes, I can make a PR if you wish, I already have it impl in my fork: aliencaocao@a3fe0e1

Feel free to make a PR

aliencaocao commented 1 month ago

Yes, I can make a PR if you wish, I already have it impl in my fork: aliencaocao@a3fe0e1

Feel free to make a PR

PR made