🍩 A pdf reading function

raynardj commented 3 years ago

I do have a proper working code for this situation, but we have to make sure pdf miner is installed.

I do not intend to put pdfminer into requirements, as it's not related to most of the other tasks.

from pathlib import Path
from shutil import which
from subprocess import check_output
import logging

def raw_pdf2txt(pdf_file: Path) -> str:
    """
    Intercept the pdf2txt.py commandline output, return as utf8 string
    """
    return check_output(["pdf2txt.py",pdf_file]).decode()

def shrink_lines(text: str) -> str:
    """
    Clean up the messy line chages
    Because the original outpout is too sparse
    With too many line breaker
    """
    return text.replace("\n\n","<line_break>")\
    .replace("- \n","")\
    .replace("-\n","")\
    .replace("\n"," ")\
    .replace("<line_break>","\n\n")

def convert_PDF(pdf_file: Path) -> str:
    """
    Convert pdf to text
    pdf_file, path of pdf file
    return a utf8 string
    """
    return shrink_lines(
        raw_pdf2txt(pdf_file)
    )

jfthuong commented 3 years ago

We could use the pip install unpackai[PDF] feature as mentioned here: https://setuptools.pypa.io/en/latest/userguide/dependency_management.html#optional-dependencies

raynardj commented 3 years ago

https://setuptools.pypa.io/en/latest/userguide/dependency_management.html#optional-dependencies

s*** I totally didn't know about this, thanks!!!!

ah...that's what the brackets about

jfthuong commented 2 years ago

Your code is actually running a command and assumes that it is installed.

I would recommend using the high-level functions: https://pdfminersix.readthedocs.io/en/latest/reference/highlevel.html#extract-text

I can re-write the code accordingly.

unpackAI / unpackai

🍩 A pdf reading function #60