sfneal / PyPDF3

A utility to read and write PDFs with Python
https://pythonhosted.org/PyPDF2/
Other
72 stars 15 forks source link

Umlaut handling #13

Open WolfgangFahl opened 2 years ago

WolfgangFahl commented 2 years ago

2021_10_20_15_51_33.pdf Has "ALDI SÜD" as copyable text in it (tested with Preview on MacOS). When reading it with PyPDF3 using:

def getPDFText(self):
        '''
        get my PDF Text
        '''
        pdfText=None
        if self.scannedFile.lower().endswith("pdf"):
            pdfText=""
            pdf_file = open(self.scannedFile, 'rb')
            read_pdf = PdfFileReader(pdf_file)
            number_of_pages = read_pdf.getNumPages()
            pdfText=""
            delim=""
            for pageNo in range(number_of_pages):
                page = read_pdf.getPage(pageNo)
                page_content = page.extractText()
                pdfText+=delim+page_content
                delim="\n"
        return pdfText

i get 'ALDI SƒD' instead. How can this be fixed?

WolfgangFahl commented 2 years ago

see also https://stackoverflow.com/q/64459824/1497139