[BUG] Failures when extracting text from pdf

ryankilroy commented 4 months ago

Description

When I attempt to extract the text from a pdf with certain embedded fonts, it returns some missing rune characters. The fonts don't seem to throw errors on the first page (which still has missing runes), but when I attempt to extract the fonts from the later pages in the pdf, I get some Can't convert font object, invalid type errors.

Expected Behavior

I expect to be able to extract usable text from the pdf

Actual Behavior

Extracting text from the pdf results in missing runes

Steps to reproduce the behavior:

Construct a PdfReader from the attached pdf
Get the first page from the reader and construct an extractor with it
Output the result of ExtractText()

If you instead run pdftotext <file.pdf> - against it, the text is fully readable

Attachments

    pdfReader, _, err := model.NewPdfReaderFromFile(filePath, nil)
        if err != nil {
        return nil, fmt.Errorf("failed to create pdf reader: %w", err)
    }
        pageCount := len(pdfReader.PageList)
    var pages []string
    for i := range pageCount {
        pageNum := i + 1
        page, err := pdfReader.GetPage(pageNum)
        if err != nil {
            return nil, fmt.Errorf("failed to get page %d: %w", pageNum, err)
        }

        ex, err := extractor.New(page)
        if err != nil {
            return nil, fmt.Errorf("failed to create extractor: %w", err)
        }

        text, err := ex.ExtractText()
        if err != nil {
            return nil, fmt.Errorf("failed to extract text: %w", err)
        }
        fmt.Println("=== UNIPDF TEXT ===")
        fmt.Println(text)
        }

    pdfToTextArgs := []string{
        filePath, "-",
    }
    cmd := exec.Command("pdftotext", pdfToTextArgs...)

    b, _ := cmd.Output()

    content := strings.Replace(string(b), `\n`, "\n", -1)
    fmt.Println("=== PDFTOTEXT TEXT===")
    fmt.Println(content)

=== UNIPDF TEXT ===
Convertibe hoder
...etc

=== PDFTOTEXT ===
Convertible holder
...etc

Sample PDF.pdf

Examples

There are more missing runes in areas of the actual pdf, but I couldn't replicate them with the anonymized data. Here are some of the examples

Issue date Aug   
Board approva date Aug   
Origina principa $ USD

Issue date
Board approval date

Aug. 4, 2023
Aug. 24, 2023

Value
Original principal

$1,000.00 (USD)

github-actions[bot] commented 4 months ago

Welcome! Thanks for posting your first issue. The way things work here is that while customer issues are prioritized, other issues go into our backlog where they are assessed and fitted into the roadmap when suitable. If you need to get this done, consider buying a license which also enables you to use it in your commercial products. More information can be found on https://unidoc.io/

kellemNegasi commented 4 months ago

Hi @ryankilroy , thank you for reporiting this issue. We were able to reproduce it using the sample code and sample file you provided and we are currently investigating the cause of it. We will write an update as soon as we identify the source of the issue and the fixes.

kellemNegasi commented 3 months ago

Hi @ryankilroy, after some investigation, we found out that the issue is in the ToUnicode map provided in the document. It has an invalid code point for the character code that represented the missing letter (l). But the reason other tools were able to extract the correct character is that they resorted to the Replacement Text data provoded as part of the marked content. Currently, our extractor doesn't implement this feature, which is why it just took the invalid code point (which is by the way in the Private Use Area of Unicode ) and extracted it as valid text. We plan to incorporate this feature in the future and provide an update on this ticket upon its release.

Regarding your second issue, i.e., font extraction, the reason for the font extraction failure is that there is no font in pages 3 and beyond (because the pages are scanned). But the error message is not informative enough to convey this. We will update this one too.

kellemNegasi commented 3 months ago

Hi @ryankilroy , This issue is fixed in the new release (v3.60.0) which can be found here https://github.com/unidoc/unipdf/releases/tag/v3.60.0. Closing this ticket as fixed.

unidoc / unipdf