How to extract text from image

Larbo53 commented 2 years ago

Hello, I would like to extract the supplier's information (top left block: Bourgeois Frères, . (see attached image) from the attached pdf. What is the solution? Os : MacOs Bigsur 11.6 Python : 3.9 PyMuPDF : 1.19.0

Thank you.

Sincerely

BOURGEOISFacture 21053886.pdf

y

JorjMcKie commented 2 years ago

First problem is, that it is no image.

>>> import fitz
>>> doc=fitz.open("BOURGEOISFacture.21053886.pdf")
>>> page=doc[0]
>>> from pprint import pprint
>>> # no standard images there:
>>> pprint(page.get_images())
[]
>>> # neither embedded images:
>>> pprint(page.get_image_info())
[]

And you will have determined that what seems to be text neither is text.

>>> page.get_text(sort=True)  # "sort" ensures text visible at the top indeed comes first
BENOIT CASTEL MENILMONTANT 
Boulangerie Pâtisserie 
150 rue de Ménilmontant 
75020 PARIS 
06 21 08 29 68
Tél. Client :
FACTURE
/
1
... (more data) ...

So the apparent text must be encoded as drawing primitives, like a capital "B" drawn as a vertical line | followed by two little semi-circles above each other, etc. The only way is therefore OCRing this area. Let's find a suitable sub-rectangle: left of "BENOIT CASTEL MENILMONTANT" and above the first "FACTURE":

>>> rl1 = page.search_for("BENOIT CASTEL MENILMONTANT")
>>> len(rl1)
3
>>> rl2 = page.search_for("FACTURE")
>>> len(rl2)
2
>>> # sort both rectangle lists to be sure: vertical, then horizontal
>>> rl1.sort(key=lambda r: (r.y1, r.x0))
>>> rl2.sort(key=lambda r: (r.y1, r.x0))
>>> # right border is left or first rl1 rect:
>>> rborder = rl1[0].x0
>>> # bottom is top coord of first rect of rl2:
>>> bottom = rl2[0].y0
>>> # define sub rect to OCR:
>>> clip = fitz.Rect(0,0,rborder,bottom)
>>> clip
Rect(0.0, 0.0, 339.3900146484375, 220.2386474609375)
>>> # make a pixmap of that rect:
>>> pix = page.get_pixmap(dpi=300,clip=clip)
>>> # make a new 1 page PDF with OCRed text
>>> pdfbytes = pix.pdfocr_tobytes()
>>> ocrpdf = fitz.open("pdf", pdfbytes)
>>> ocrpage = ocrpdf[0]
>>> print(ocrpage.get_text())
Bourgeois Fréres S.A.S au Capital de 330 000 Euro
77510 VERDELOT
TEL
: 01 64 04 81 04
FAX
: 01 64 04 81 43
SITE INTERNET
: WWW.MOULINS-BOURGEOIS.COM
N° SIRET 746 050 087 00012
R.C. MEAUXB 746 050 087
CODE APE: 1061A
N° de TVA INTRACOMMUNAUTAIRE
: FR15746050087

>>>

JorjMcKie commented 2 years ago

Of course that OCRed text could also have been extracted with coordinates, e.g. using ocrpage.get_text("dict"). Those coordinates obviously are relative to that ocrpage's dimensions. If required translate them back to the original page's positions using the matrix mat = ocrpage.rect.torect(clip).

Larbo53 commented 2 years ago

Thanks a lot

Which package must be imported to works?

Sincerely

Yves Larbodiere EVARD 3, rue des Courtes Terres 95220 HERBLAY portable : 07 81 08 41 00 mail : @.***

Le 21 janv. 2022 à 16:24, Jorj X. McKie @.***> a écrit :

Of course that OCRed text could also have been extracted with coordinates, e.g. using ocrpage.get_text("dict"). Those coordinates obviously are relative to that ocrpage's dimensions. If required translate them back to the original page's positions using the matrix mat = ocrpage.rect.torect(clip).

— Reply to this email directly, view it on GitHub https://github.com/pymupdf/PyMuPDF-Utilities/issues/32#issuecomment-1018607332, or unsubscribe https://github.com/notifications/unsubscribe-auth/AU2RHDFZP4VARUR5INAOVT3UXF3BBANCNFSM5MPUEXAQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you authored the thread.

pymupdf / PyMuPDF-Utilities

How to extract text from image #32