Running model on PDFs and Generating tokens for words.json from PDF

Nikhilsonawane07 commented 9 months ago

I am generating tokens for Table detection model from PDF using following script.

And then converting pdf to images. However I am getting an error while running inference pipeline.

Please help with the issue. Also if anyone can share there script to run model on pdfs, it will be great ! Thanks is advance!

aostiles commented 9 months ago

I ran into a similar problem and fixed it. In my case, it was because Rect() expects four params: Rect(x0, y0, x1, y1) where (x0, y0) is the bottom-left and (x1, y1) is the top-right.

I'm using easyocr. To represent a bounding box, it returns a list of four coordinate-pairs. So I had to grab the two relevant coordinates.

I also ran into an issue serializing int64s. Here's my OCR code:

import json
import easyocr

reader = easyocr.Reader(['en'])
result = reader.readtext('path/to/image.jpg')
words = []

for _, word in enumerate(result):
    bbox_raw = word[0]
    bbox = [bbox_raw[0][0], bbox_raw[0][1], bbox_raw[2][0], bbox_raw[2][1]]
    text = word[1]
    words.append({"text": text, "bbox": bbox})

with open("path/to/image_words.json", "w") as file:
    json.dump(words, file, default=int)

I ran as follows:

python inference.py --image_dir ../path/to/img/ --words_dir ../path/to/words/ --out_dir ../results --structure_config_path ./structure_config.json --structure_model_path ./TATR-v1.1-Pub-msft.pth --mode extract --csv --detection_config_path ./detection_config.json --detection_model_path ./pubtables1m_detection_detr_r18.pth --visualize

linkstatic12 commented 8 months ago

I would advise to use pdftools which is available in R. This library can be used in python. The pdftools are much more accurate when it comes to pdf manipulation

Nikhilsonawane07 commented 8 months ago

I ran into a similar problem and fixed it. In my case, it was because Rect() expects four params: Rect(x0, y0, x1, y1) where (x0, y0) is the bottom-left and (x1, y1) is the top-right.

I'm using easyocr. To represent a bounding box, it returns a list of four coordinate-pairs. So I had to grab the two relevant coordinates.

I also ran into an issue serializing int64s. Here's my OCR code:
import json
import easyocr

reader = easyocr.Reader(['en'])
result = reader.readtext('path/to/image.jpg')
words = []

for _, word in enumerate(result):
    bbox_raw = word[0]
    bbox = [bbox_raw[0][0], bbox_raw[0][1], bbox_raw[2][0], bbox_raw[2][1]]
    text = word[1]
    words.append({"text": text, "bbox": bbox})

with open("path/to/image_words.json", "w") as file:
    json.dump(words, file, default=int)
I ran as follows:
python inference.py --image_dir ../path/to/img/ --words_dir ../path/to/words/ --out_dir ../results --structure_config_path ./structure_config.json --structure_model_path ./TATR-v1.1-Pub-msft.pth --mode extract --csv --detection_config_path ./detection_config.json --detection_model_path ./pubtables1m_detection_detr_r18.pth --visualize

Hey thanks for the reply, but I am looking to read text from pdf only not from images

linkstatic12 commented 8 months ago

you can convert the PDF pages to images.

microsoft / table-transformer

Running model on PDFs and Generating tokens for words.json from PDF #147