Add Support for PDF Input with OCR Capabilities

shreyashankar commented 2 months ago

Objective

Extend DocETL to support PDF files as input, including the ability to extract text using OCR for scanned documents or images within PDFs.

Background

Currently, DocETL supports structured data formats like JSON. Adding PDF support with OCR will greatly expand our ability to process a wider range of document types, including scanned documents and image-based PDFs.

Tasks

Research and evaluate OCR libraries and tools suitable for integration with DocETL.
Implement PDF parsing and text extraction for standard PDFs.
Integrate OCR capabilities for handling scanned documents and images within PDFs.
Update the Dataset class to handle PDF inputs.
Modify the YAML configuration to support PDF dataset specifications.
Add appropriate unit tests and update documentation.

Discussion Points

What are the current state-of-the-art OCR techniques we should consider? I (Shreya) am not too familiar; I've only used Azure Document Intelligence and it was good enough for me. There's some wrangling one has to do to convert the Azure output to a readable format, which is kind of annoying.
Should we commit to specific OCR tools, or design a flexible interface to support multiple options?
How can we balance accuracy, speed, and ease of use in our OCR implementation?

Next Steps

If you're interested in tackling this issue or have insights to share, please reach out via Discord or comment on this issue. We'd love to discuss approach, tool selection, and implementation details further.

staru09 commented 1 month ago

Greetings I want to work on this issue.

shreyashankar commented 1 month ago

That would be great! I think a starting point is to create a dataset class actually, like #2 (but we don't have to support all data file types), and then support folder & pdf inputs. Any thoughts on OCR APIs?

staru09 commented 1 month ago

I haven't use APIs for OCR but from the various tools that I have used or I have come across, paddle_ocr has been the best. We can deploy this as an API and then will be able to use it.

rodion-m commented 1 month ago

Also, take a look into https://github.com/nlmatics/llmsherpa recently they opened their server— it has built-in OCR support.

shreyashankar commented 1 month ago

PaddleOCR seems well-maintained, even if it may or may not be SOTA. It seems like they already provide a Docker image? Or we can create one if we need to.

Then in the DocETL config, maybe a user can specify the uri to their Paddle deployment, and we can query that while loading/processing data. LMK what you think!

staru09 commented 1 month ago

PaddleOCR does provide a docker image that can be used I'll try to run it in a day or two in my local PC. Some other cool OCRs that maybe SOTA or at least close to SOTA are

surya_ocr it's benchmarks are pretty good but it is very slow to infer without GPU.
general ocr theory (https://github.com/Ucas-HaoranWei/GOT-OCR2.0) it also has a docker image and uses LLAVA (qwen).
dt_ocr (https://arxiv.org/pdf/2308.15996v1) I found this to be the best on papers with code but it's implementation is incomplete and unofficial.

shreyashankar commented 1 month ago

Perfect. Thank you!

rajib76 commented 1 month ago

For PDF, one technique you can try is to convert it into markdown using llamaparse or Azure Doc AI and then process it(assuming that llm models will understand markdown better)

shreyashankar commented 1 month ago

A good place to add an OCR parsing tool is here: https://github.com/ucbepic/docetl/blob/fb900c10cbfec80c69272f965ff6294701eb6ede/docetl/parsing_tools.py

Assume you get access to the path of the pdf. The parsing function should return a list of strings, where each string represents a document. Probably in the PDF case this list will only have one string. It's a list because other parsing function (e.g., excel file) might actually return multiple documents.

staru09 commented 1 month ago

import PyPDF2
import os

def pdf_to_string(input_path: str, output_path: str) -> None:
    """
    Extract text from a PDF file and save it as a text file.

    Args:
        input_path (str): Path to the input PDF file.
        output_path (str): Path where the output text will be saved.

    Returns:
        a text file 
    """
    with open(input_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        pdf_text = []

        for page_num in range(len(reader.pages)):
            page = reader.pages[page_num]
            pdf_text.append(page.extract_text())

        full_text = "\n".join(pdf_text)

        with open(output_path, "w", encoding="utf-8") as output_file:
            output_file.write(full_text)

        print(f"Text extracted and saved to {output_path}")

if __name__ == "__main__":
    input_pdf = ""   
    output_txt = "" 
    pdf_to_string(input_pdf, output_txt)

I gave this pdf file as input and got an output like this

Kindly see if anything else is to be added to this and suggest how I can improve this further.

rajib76 commented 1 month ago

Hi Shreya With PDF we need a layout parser, I think pypdf2 does not do that. We can connect and discuss if it is ok with you .PDF and excel are difficult formats

Thanks Rajib

On Tue, Oct 1, 2024 at 11:23 AM Aru Sharma @.***> wrote:

import PyPDF2 import os

def pdf_to_string(input_path: str, output_path: str) -> None: """ Extract text from a PDF file and save it as a text file.
Args:
    input_path (str): Path to the input PDF file.
    output_path (str): Path where the output text will be saved.

Returns:
    a text file
"""
with open(input_path, "rb") as file:
    reader = PyPDF2.PdfReader(file)
    pdf_text = []

    for page_num in range(len(reader.pages)):
        page = reader.pages[page_num]
        pdf_text.append(page.extract_text())

    full_text = "\n".join(pdf_text)

    with open(output_path, "w", encoding="utf-8") as output_file:
        output_file.write(full_text)

    print(f"Text extracted and saved to {output_path}")
if name == "main": input_pdf = "" output_txt = "" pdf_to_string(input_pdf, output_txt)

I gave this pdf https://drive.google.com/file/d/13nRnGPie63guMCz9Nlcs_W4hDVmctM70/view?usp=sharing file as input and got an output like this https://drive.google.com/file/d/1cC_boVaxrksxP9tECeAkkTJa_kDRuUir/view?usp=sharing

Kindly see if anything else is to be added to this and suggest how I can improve this further.

— Reply to this email directly, view it on GitHub https://github.com/ucbepic/docetl/issues/3#issuecomment-2386678536, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4VIRELLRDUSBKKXMIBVXDZZLSBZAVCNFSM6AAAAABOMBJGMWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBWGY3TQNJTGY . You are receiving this because you commented.Message ID: @.***>

shreyashankar commented 1 month ago

Yeah, I'm realizing layout matters after talking to many people...discussion happening on discord: https://discord.com/channels/1285485891095236608/1290751112936296449

phirsch commented 1 month ago

FWIW, when I looked into PDF parsing a while ago, NougatOCR (paper) and Marker stood out as powerful alternatives if you can live with their compute requirements and licenses. Probably still worth taking a close look.

ucbepic / docetl