microsoft / table-transformer

Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric.
MIT License
2.01k stars 231 forks source link

Running model on PDFs and Generating tokens for words.json from PDF #147

Open Nikhilsonawane07 opened 9 months ago

Nikhilsonawane07 commented 9 months ago

I am generating tokens for Table detection model from PDF using following script.

image

And then converting pdf to images. However I am getting an error while running inference pipeline.

image

Please help with the issue. Also if anyone can share there script to run model on pdfs, it will be great ! Thanks is advance!

aostiles commented 9 months ago

I ran into a similar problem and fixed it. In my case, it was because Rect() expects four params: Rect(x0, y0, x1, y1) where (x0, y0) is the bottom-left and (x1, y1) is the top-right.

I'm using easyocr. To represent a bounding box, it returns a list of four coordinate-pairs. So I had to grab the two relevant coordinates.

I also ran into an issue serializing int64s. Here's my OCR code:

import json
import easyocr

reader = easyocr.Reader(['en'])
result = reader.readtext('path/to/image.jpg')
words = []

for _, word in enumerate(result):
    bbox_raw = word[0]
    bbox = [bbox_raw[0][0], bbox_raw[0][1], bbox_raw[2][0], bbox_raw[2][1]]
    text = word[1]
    words.append({"text": text, "bbox": bbox})

with open("path/to/image_words.json", "w") as file:
    json.dump(words, file, default=int)

I ran as follows:

python inference.py --image_dir ../path/to/img/ --words_dir ../path/to/words/ --out_dir ../results --structure_config_path ./structure_config.json --structure_model_path ./TATR-v1.1-Pub-msft.pth --mode extract --csv --detection_config_path ./detection_config.json --detection_model_path ./pubtables1m_detection_detr_r18.pth --visualize
linkstatic12 commented 8 months ago

I would advise to use pdftools which is available in R. This library can be used in python. The pdftools are much more accurate when it comes to pdf manipulation

Nikhilsonawane07 commented 8 months ago

I ran into a similar problem and fixed it. In my case, it was because Rect() expects four params: Rect(x0, y0, x1, y1) where (x0, y0) is the bottom-left and (x1, y1) is the top-right.

I'm using easyocr. To represent a bounding box, it returns a list of four coordinate-pairs. So I had to grab the two relevant coordinates.

I also ran into an issue serializing int64s. Here's my OCR code:

import json
import easyocr

reader = easyocr.Reader(['en'])
result = reader.readtext('path/to/image.jpg')
words = []

for _, word in enumerate(result):
    bbox_raw = word[0]
    bbox = [bbox_raw[0][0], bbox_raw[0][1], bbox_raw[2][0], bbox_raw[2][1]]
    text = word[1]
    words.append({"text": text, "bbox": bbox})

with open("path/to/image_words.json", "w") as file:
    json.dump(words, file, default=int)

I ran as follows:

python inference.py --image_dir ../path/to/img/ --words_dir ../path/to/words/ --out_dir ../results --structure_config_path ./structure_config.json --structure_model_path ./TATR-v1.1-Pub-msft.pth --mode extract --csv --detection_config_path ./detection_config.json --detection_model_path ./pubtables1m_detection_detr_r18.pth --visualize

Hey thanks for the reply, but I am looking to read text from pdf only not from images

linkstatic12 commented 8 months ago

you can convert the PDF pages to images.