Open miguelgarces123 opened 5 years ago
There are two problems here. One is that Tesseract OCR doesn't really understand tables and thinks of text on the same line as being related. The second problem is that the PDF specification does not have much in the way of markup – it doesn't know that there's tabular data as with HTML. It "sees" a drawing with text and line art.
It so happens I'm working with PDF tables a lot right now. If you're trying to extract data from tables to Excel you can try Tabula (fast, less accurate) or Camelot (very slow but more accurate). It can help to run ocrmypdf to do OCR first and you'll definitely want to deskew.
Generally tabular OCR requires some awareness of tables in the OCR engine. An interesting idea would be to do table segmentation assisted by Tabula or Camelot and then do OCR within each table cell. That would solve the fairly major issue of OCR mistaking table borders for text, but would struggle when text overlaps table borders.
Abbyy FineReader also does okay as a commercial option.
@jbarlow83 Thank you very much for your quick response.
I currently use Camelot to extract the cell just after passing the ocr of this project. But just what you say happens, as the ocr reads horizontally, mixes the contents of the cells.
I like your idea, but I have no experience with opencv or similar packages, could you suggest a way to detect the cells of a table to be able to go to tesseract for that position?
Camelot already has a bag of tricks to find a table grid in an image (for framed tables): https://camelot-py.readthedocs.io/en/master/user/how-it-works.html#lattice
And it returns the coordinates of table cells...
tlist = camelot.read_pdf(...)
cells = tlist[0].cells
print(cells)
Units are probably in PDF points (1/72").
Then an image of the page could be cropped to the size of the cell and sent for OCR, and the results grafted back to the PDF. That last step would look a lot like the procedure in ocrmypdf._graft
. That would render text one cell at a time. Then run Camelot again and hope the extraction is cleaner. That might get better results except in files where text crosses over cell boundaries.
@jbarlow83 Perfect. You are a genious.
Excuse me for ignorance, how could I reassemble the pdf once I managed to extract the cleanest text from the image?
It might work something like this:
pikepdf.Page(pdf.pages[0]).as_form_xobject()
)copy_foreign
to transfer the Form XObject to the destination PDFcm
operator to set the coordinate system to the cell location on the destination page, and then draw.@jbarlow83 Thanks for the reply.
I mean I could do the following:
1- Transform the pdf with your package: OCRmyPDF 2- Then go through the pages that have tables, transform them into an image and extract the positions of the cells 3- Then use opencsv to crop the grid and pass it through tesseract 4- The result would replace it in the pdf with the help of the package: pikepdf. (Here I have the doubt of how to replace an existing content in a position within the pdf?)
Do you think that approach is correct? Thanks for your time.
Yes, that's the idea. With stress it on being an idea, and I don't know if will yield improvements for sure.
Here's an easier way that would be suitable for testing if the idea has merit. Instead of cropping, flood fill the page image to white everywhere except the current cell, run Tesseract in text only PDF mode, then use qpdf --underlay
to stack all of the individual "cell PDFs". That will be inefficient but suitable for testing the idea.
@jbarlow83 I already managed to have the coordinates of the cells by camelot, but I don't know how to proceed to start extracting the cells in the image to pass them through tesseract, do you have to resize the image to match the camelot coordinates? Could you help me ?
I think Camelot uses PDF units called points, which are 1/72 of an inch.
If you render the PDF page as an image you can specify the dpi with Ghostscript (gs -r 300...
).
Then you know the ratio: 300 pixels/inch or 4.12 pixels/pt. So multiple your Camelot units by 4.12 to convert to pixels.
@jbarlow83 Hello!.
I already managed to extract the clean text from each cell, I had some challenges with camelot, if anyone has a problem with this step, I can tell you how I did it.
At this moment I am in the step which I must modify the pdf to replace the content of each cell with the correct one, could you give me a light on how I could do this with the package: pikepdf?
That's where I was suggesting you could do it with qpdf --underlay
.
For each cell: white-out the full page image except for the cell, send that image to Tesseract, and then "qpdf underlay" the result onto the output PDF.
(If you happen to be doing this for a commercial project and you have a budget for external support we could discuss a contract. Send an email to jim@purplerock.ca if you want to explore having me implement something for you. I'm happy to continue sketching a general idea here.)
I get it. I did it with cv2, after obtaining the coordinates of each cell with camelot, I take the image that I already have in memory and extract the cell in this way: text = tesseract.image_to_string (image [y2: y1 x1: x2])
I was thinking of editing the pdf and replacing the text with those corresponding coordinates, as I have seen in some documentation, I must know the ID of the object to modify it, do you know how I could look for it? or a tool like reportlab could work tried using: drawText?
Thank you for your interest, in a few hours I send you what we do so you can see what you can imagine. This project is for internal use for the company, not for the consumption of external ones, so we look for something that is single payment.
Any ideas to modify the text in a specific position within a pdf?
You need a content stream parser that tracks the graphics state enough to determine when and where text was drawn. You can exploit the behavior of the Tesseract PDF generator to specifically look for text it generates, because it will select a font named "GlyphLess" and always render UTF-16BE encoded strings. It is easier to generate a file in the HOCR format, edit that, and convert to PDF, if the intent is to edit/postprocess OCR text.
It's very difficult to do fully general text editing, and not always possible because of subsetted fonts and the exotic/ancient font formats supported by PDF that are still sometimes generated.
Hi @jbarlow83 , how could you modify the text in a hocr file, do you know any tools?
hocr tools : https://github.com/tmbdev/hocr-tools
Are you still working on this? I got lots of tables inside PDFs and I want to OCR them. Adobe is quite slow, but manage to OCR them. Also I wanted to do it automatically.
Maybe this is helpful: https://stackoverflow.com/questions/59370642/how-to-extract-text-from-table-in-image
@thethinker990 I don't have anything open source.
The link wasn't helpful? There have to be a way
You must preprocess the image to remove the table lines and dots before throwing it into OCR. Here's an approach using OpenCV.
Load image, grayscale, and Otsu's threshold Remove horizontal lines Remove vertical lines Dilate to connect text and remove dots using contour area filtering Bitwise-and to reconstruct image OCR
And then map the letters to the ocr-layer. As a first step it would be helpful to just find the full text on the right page.
Using the advise above I made this workaround.
import camelot
import PIL
import pandas as pd
import pytesseract
def ocr_form(filename):
tables = camelot.read_pdf(filename, line_scale=80, dpi=300)
image = tables[0]._image[0]
table_dfs = []
for table in tables:
table_df = []
for row in table.cells:
row_df = []
for cell in row:
bbox = [cell.x1, cell.y1, cell.x2, cell.y2]
# I'm not sure how to properly calculate this coeff
# I just guessed and checked
bbox = [int(4.165*coord) for coord in bbox]
# Cut off a few pixes to avoid tesseract detecting form lines
bbox = [bbox[0]+5, image.shape[0] - bbox[1] - 5, bbox[2]-4, image.shape[0] - bbox[3] + 5]
cell_image = image[bbox[3]:bbox[1], bbox[0]:bbox[2]]
pil_image = PIL.Image.fromarray(cell_image.astype('uint8'), 'RGB')
text = pytesseract.image_to_string(pil_image, config='--psm 6')
row_df.append(text.strip())
table_df.append(row_df)
table_dfs.append(table_df)
return [pd.DataFrame(table_df) for table_df in table_dfs]
The project does not work well with tables, could something be done?