ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.1k stars 1.02k forks source link

Pdf error with tables #446

Open miguelgarces123 opened 5 years ago

miguelgarces123 commented 5 years ago

The project does not work well with tables, could something be done?

imagen

jbarlow83 commented 5 years ago

There are two problems here. One is that Tesseract OCR doesn't really understand tables and thinks of text on the same line as being related. The second problem is that the PDF specification does not have much in the way of markup – it doesn't know that there's tabular data as with HTML. It "sees" a drawing with text and line art.

It so happens I'm working with PDF tables a lot right now. If you're trying to extract data from tables to Excel you can try Tabula (fast, less accurate) or Camelot (very slow but more accurate). It can help to run ocrmypdf to do OCR first and you'll definitely want to deskew.

Generally tabular OCR requires some awareness of tables in the OCR engine. An interesting idea would be to do table segmentation assisted by Tabula or Camelot and then do OCR within each table cell. That would solve the fairly major issue of OCR mistaking table borders for text, but would struggle when text overlaps table borders.

Abbyy FineReader also does okay as a commercial option.

miguelgarces123 commented 5 years ago

@jbarlow83 Thank you very much for your quick response.

I currently use Camelot to extract the cell just after passing the ocr of this project. But just what you say happens, as the ocr reads horizontally, mixes the contents of the cells.

I like your idea, but I have no experience with opencv or similar packages, could you suggest a way to detect the cells of a table to be able to go to tesseract for that position?

jbarlow83 commented 5 years ago

Camelot already has a bag of tricks to find a table grid in an image (for framed tables): https://camelot-py.readthedocs.io/en/master/user/how-it-works.html#lattice

And it returns the coordinates of table cells...

tlist = camelot.read_pdf(...)
cells = tlist[0].cells
print(cells)

Units are probably in PDF points (1/72").

Then an image of the page could be cropped to the size of the cell and sent for OCR, and the results grafted back to the PDF. That last step would look a lot like the procedure in ocrmypdf._graft. That would render text one cell at a time. Then run Camelot again and hope the extraction is cleaner. That might get better results except in files where text crosses over cell boundaries.

miguelgarces123 commented 5 years ago

@jbarlow83 Perfect. You are a genious.

Excuse me for ignorance, how could I reassemble the pdf once I managed to extract the cleanest text from the image?

jbarlow83 commented 5 years ago

It might work something like this:

miguelgarces123 commented 5 years ago

@jbarlow83 Thanks for the reply.

I mean I could do the following:

1- Transform the pdf with your package: OCRmyPDF 2- Then go through the pages that have tables, transform them into an image and extract the positions of the cells 3- Then use opencsv to crop the grid and pass it through tesseract 4- The result would replace it in the pdf with the help of the package: pikepdf. (Here I have the doubt of how to replace an existing content in a position within the pdf?)

Do you think that approach is correct? Thanks for your time.

jbarlow83 commented 5 years ago

Yes, that's the idea. With stress it on being an idea, and I don't know if will yield improvements for sure.

Here's an easier way that would be suitable for testing if the idea has merit. Instead of cropping, flood fill the page image to white everywhere except the current cell, run Tesseract in text only PDF mode, then use qpdf --underlay to stack all of the individual "cell PDFs". That will be inefficient but suitable for testing the idea.

miguelgarces123 commented 5 years ago

@jbarlow83 I already managed to have the coordinates of the cells by camelot, but I don't know how to proceed to start extracting the cells in the image to pass them through tesseract, do you have to resize the image to match the camelot coordinates? Could you help me ?

jbarlow83 commented 5 years ago

I think Camelot uses PDF units called points, which are 1/72 of an inch.

If you render the PDF page as an image you can specify the dpi with Ghostscript (gs -r 300...). Then you know the ratio: 300 pixels/inch or 4.12 pixels/pt. So multiple your Camelot units by 4.12 to convert to pixels.

miguelgarces123 commented 5 years ago

@jbarlow83 Hello!.

I already managed to extract the clean text from each cell, I had some challenges with camelot, if anyone has a problem with this step, I can tell you how I did it.

At this moment I am in the step which I must modify the pdf to replace the content of each cell with the correct one, could you give me a light on how I could do this with the package: pikepdf?

jbarlow83 commented 5 years ago

That's where I was suggesting you could do it with qpdf --underlay.

For each cell: white-out the full page image except for the cell, send that image to Tesseract, and then "qpdf underlay" the result onto the output PDF.

(If you happen to be doing this for a commercial project and you have a budget for external support we could discuss a contract. Send an email to jim@purplerock.ca if you want to explore having me implement something for you. I'm happy to continue sketching a general idea here.)

miguelgarces123 commented 5 years ago

I get it. I did it with cv2, after obtaining the coordinates of each cell with camelot, I take the image that I already have in memory and extract the cell in this way: text = tesseract.image_to_string (image [y2: y1 x1: x2])

I was thinking of editing the pdf and replacing the text with those corresponding coordinates, as I have seen in some documentation, I must know the ID of the object to modify it, do you know how I could look for it? or a tool like reportlab could work tried using: drawText?

Thank you for your interest, in a few hours I send you what we do so you can see what you can imagine. This project is for internal use for the company, not for the consumption of external ones, so we look for something that is single payment.

miguelgarces123 commented 5 years ago

Any ideas to modify the text in a specific position within a pdf?

jbarlow83 commented 5 years ago

You need a content stream parser that tracks the graphics state enough to determine when and where text was drawn. You can exploit the behavior of the Tesseract PDF generator to specifically look for text it generates, because it will select a font named "GlyphLess" and always render UTF-16BE encoded strings. It is easier to generate a file in the HOCR format, edit that, and convert to PDF, if the intent is to edit/postprocess OCR text.

It's very difficult to do fully general text editing, and not always possible because of subsetted fonts and the exotic/ancient font formats supported by PDF that are still sometimes generated.

miguelgarces123 commented 5 years ago

Hi @jbarlow83 , how could you modify the text in a hocr file, do you know any tools?

jbarlow83 commented 5 years ago

hocr tools : https://github.com/tmbdev/hocr-tools

thethinker990 commented 3 years ago

Are you still working on this? I got lots of tables inside PDFs and I want to OCR them. Adobe is quite slow, but manage to OCR them. Also I wanted to do it automatically.

Maybe this is helpful: https://stackoverflow.com/questions/59370642/how-to-extract-text-from-table-in-image

jbarlow83 commented 3 years ago

@thethinker990 I don't have anything open source.

thethinker990 commented 3 years ago

The link wasn't helpful? There have to be a way

You must preprocess the image to remove the table lines and dots before throwing it into OCR. Here's an approach using OpenCV.

Load image, grayscale, and Otsu's threshold Remove horizontal lines Remove vertical lines Dilate to connect text and remove dots using contour area filtering Bitwise-and to reconstruct image OCR

And then map the letters to the ocr-layer. As a first step it would be helpful to just find the full text on the right page.

agiera commented 1 year ago

Using the advise above I made this workaround.

import camelot
import PIL
import pandas as pd
import pytesseract

def ocr_form(filename):
    tables = camelot.read_pdf(filename, line_scale=80, dpi=300)
    image = tables[0]._image[0]

    table_dfs = []
    for table in tables:
        table_df = []
        for row in table.cells:
            row_df = []
            for cell in row:
                bbox = [cell.x1, cell.y1, cell.x2, cell.y2]
                # I'm not sure how to properly calculate this coeff
                # I just guessed and checked
                bbox = [int(4.165*coord) for coord in bbox]
                # Cut off a few pixes to avoid tesseract detecting form lines
                bbox = [bbox[0]+5, image.shape[0] - bbox[1] - 5, bbox[2]-4, image.shape[0] - bbox[3] + 5]

                cell_image = image[bbox[3]:bbox[1], bbox[0]:bbox[2]]
                pil_image = PIL.Image.fromarray(cell_image.astype('uint8'), 'RGB')

                text = pytesseract.image_to_string(pil_image, config='--psm 6')
                row_df.append(text.strip())
            table_df.append(row_df)
        table_dfs.append(table_df)
    return [pd.DataFrame(table_df) for table_df in table_dfs]