Closed shreyashankar closed 1 month ago
Greetings I want to work on this issue.
That would be great! I think a starting point is to create a dataset class actually, like #2 (but we don't have to support all data file types), and then support folder & pdf inputs. Any thoughts on OCR APIs?
I haven't use APIs for OCR but from the various tools that I have used or I have come across, paddle_ocr has been the best. We can deploy this as an API and then will be able to use it.
Also, take a look into https://github.com/nlmatics/llmsherpa recently they opened their server— it has built-in OCR support.
PaddleOCR seems well-maintained, even if it may or may not be SOTA. It seems like they already provide a Docker image? Or we can create one if we need to.
Then in the DocETL config, maybe a user can specify the uri to their Paddle deployment, and we can query that while loading/processing data. LMK what you think!
PaddleOCR does provide a docker image that can be used I'll try to run it in a day or two in my local PC. Some other cool OCRs that maybe SOTA or at least close to SOTA are
Perfect. Thank you!
For PDF, one technique you can try is to convert it into markdown using llamaparse or Azure Doc AI and then process it(assuming that llm models will understand markdown better)
A good place to add an OCR parsing tool is here: https://github.com/ucbepic/docetl/blob/fb900c10cbfec80c69272f965ff6294701eb6ede/docetl/parsing_tools.py
Assume you get access to the path of the pdf. The parsing function should return a list of strings, where each string represents a document. Probably in the PDF case this list will only have one string. It's a list because other parsing function (e.g., excel file) might actually return multiple documents.
import PyPDF2
import os
def pdf_to_string(input_path: str, output_path: str) -> None:
"""
Extract text from a PDF file and save it as a text file.
Args:
input_path (str): Path to the input PDF file.
output_path (str): Path where the output text will be saved.
Returns:
a text file
"""
with open(input_path, "rb") as file:
reader = PyPDF2.PdfReader(file)
pdf_text = []
for page_num in range(len(reader.pages)):
page = reader.pages[page_num]
pdf_text.append(page.extract_text())
full_text = "\n".join(pdf_text)
with open(output_path, "w", encoding="utf-8") as output_file:
output_file.write(full_text)
print(f"Text extracted and saved to {output_path}")
if __name__ == "__main__":
input_pdf = ""
output_txt = ""
pdf_to_string(input_pdf, output_txt)
I gave this pdf file as input and got an output like this
Kindly see if anything else is to be added to this and suggest how I can improve this further.
Hi Shreya With PDF we need a layout parser, I think pypdf2 does not do that. We can connect and discuss if it is ok with you .PDF and excel are difficult formats
Thanks Rajib
On Tue, Oct 1, 2024 at 11:23 AM Aru Sharma @.***> wrote:
import PyPDF2 import os
def pdf_to_string(input_path: str, output_path: str) -> None: """ Extract text from a PDF file and save it as a text file.
Args: input_path (str): Path to the input PDF file. output_path (str): Path where the output text will be saved. Returns: a text file """ with open(input_path, "rb") as file: reader = PyPDF2.PdfReader(file) pdf_text = [] for page_num in range(len(reader.pages)): page = reader.pages[page_num] pdf_text.append(page.extract_text()) full_text = "\n".join(pdf_text) with open(output_path, "w", encoding="utf-8") as output_file: output_file.write(full_text) print(f"Text extracted and saved to {output_path}")
if name == "main": input_pdf = "" output_txt = "" pdf_to_string(input_pdf, output_txt)
I gave this pdf https://drive.google.com/file/d/13nRnGPie63guMCz9Nlcs_W4hDVmctM70/view?usp=sharing file as input and got an output like this https://drive.google.com/file/d/1cC_boVaxrksxP9tECeAkkTJa_kDRuUir/view?usp=sharing
Kindly see if anything else is to be added to this and suggest how I can improve this further.
— Reply to this email directly, view it on GitHub https://github.com/ucbepic/docetl/issues/3#issuecomment-2386678536, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4VIRELLRDUSBKKXMIBVXDZZLSBZAVCNFSM6AAAAABOMBJGMWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBWGY3TQNJTGY . You are receiving this because you commented.Message ID: @.***>
Yeah, I'm realizing layout matters after talking to many people...discussion happening on discord: https://discord.com/channels/1285485891095236608/1290751112936296449
Objective
Extend DocETL to support PDF files as input, including the ability to extract text using OCR for scanned documents or images within PDFs.
Background
Currently, DocETL supports structured data formats like JSON. Adding PDF support with OCR will greatly expand our ability to process a wider range of document types, including scanned documents and image-based PDFs.
Tasks
Dataset
class to handle PDF inputs.Discussion Points
Next Steps
If you're interested in tackling this issue or have insights to share, please reach out via Discord or comment on this issue. We'd love to discuss approach, tool selection, and implementation details further.