huridocs / pdf-text-extraction

This project aims to extract text from PDF files using the outputs generated by the pdf-document-layout-analysis service. By leveraging the segmentation and classification capabilities of the underlying analysis tool, this project automates the process of text extraction from PDF files.
Apache License 2.0
13 stars 0 forks source link

Can this extract text from image-only PDFs? #1

Closed vincent-czi closed 2 days ago

vincent-czi commented 3 days ago

I have a PDF that is composed only of photos of text: can pdf-text-extraction pull text from that?

Specifically, it's a 5-page PDF that is purely photos of pages of a book. It's intentionally supposed to be representative of a worst-case scan of a document. It's just 5 smartphone photos of 5 pages, stitched together sequentially into a PDF (done by opening the images together in Mac's Preview then printing as PDF). When I run it through the pdf-text-extraction service, I'm not getting any text. Is it able to perform OCR on the document as part of pulling out the text, or is it more about performing segmentation and then leveraging those segmentations to pull out embedded text strings in an intelligent way?

I'm running this on an older Mac laptop (2019), so using it in the no GPU mode. Here's what I'm running.

make start_no_gpu
curl -X POST -F 'file=@/path/to/my_bad_scan.pdf' localhost:5080/text

But then the output I get is simply the below. It takes a little under 2 minutes to run in total.

"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"

So it appears that it's not managing to extract any text info from the images, even though they are images of a book and readable to human eyes / OCR. Is there a setting I'm missing, or is this intentional and the pdf-text-extraction tool isn't actually meant to do OCR? Thanks very much!

ali6parmak commented 2 days ago

Hi, unfortunately our service does not support OCR at least for now. But maybe in the future this could change. Thanks for your interest!

txau commented 2 days ago

@vincent-czi OCRing the PDF needs to be done apart before segmenting it. We use two different approaches for segmentation, one uses a computer vision approach (slower, requires more resources but higher accuracy), the other one uses the text layer of the documents (faster and cheaper but lower accuracy).

You may be interested in checking our repo for the wrapper for tesseract OCR: https://github.com/huridocs/pdf_ocr_service

vincent-czi commented 2 days ago

Thank you! And thanks for the link to the OCR service, I wasn't aware of it previously!