jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Unrecognized Font #335

Closed tscrosb closed 3 years ago

tscrosb commented 3 years ago

What are you trying to do?

I am using pdfplumber to look for 12 digit strings in a PDF. My code worked when the font was Helvetica, but stopped working when I changed font to stsong-light

What code are you using to do it?

for filepath in glob.iglob(r"C:\Users\thomascrosbie\Desktop\ALL ANALYSIS\ANALYSIS_6*.pdf"): print(filepath) pdf_file = filepath excel_output = set() with pdfplumber.open(pdf_file) as pdf : pages = pdf.pages for i,pg in enumerate(pages): tbl = pages[i].extract_text()

look for account number

        p = re.compile(r"(\d{12})")
        result = p.findall(tbl)
        if(result):
            excel_output.add(result[0])
        else:
            excel_output.add('0')

PDF file

image

Expected behavior

Excel file with 12 digit string

Actual behavior

12 digit string not detected

Environment

samkit-jain commented 3 years ago

Appears to be a duplicate of #332 If the text isn't recognised when you run pdfminer's pdf2txt as described in https://pdfminersix.readthedocs.io/en/latest/tutorial/commandline.html#pdf2txt-py, I would recommend you to raise an issue over at https://github.com/pdfminer/pdfminer.six/issues. If it is recognised, then please reopen this issue and share the PDF as well.