unidoc / unipdf

Golang PDF library for creating and processing PDF files (pure go)
https://unidoc.io
Other
2.46k stars 250 forks source link

[BUG] Extracted text from table is reversed when text is styled with an underline #557

Open lavens opened 3 weeks ago

lavens commented 3 weeks ago

Description

When I extract text from a pdf that contains a table, where the table content is formatted with underline, each newline of text within a cell is reversed. Once the underline formatting is removed, the text is extracted in order as expected.

Expected Behaviour

I am able to extract text from a table with the order of text preserved regardless of the formatting applied.

Actual Behaviour

Steps to reproduce the behaviour:

  1. Construct a PdfReader from the attached pdf
  2. Get the first page from the reader and construct an extractor with it
  3. Output the result of ExtractText()

Attachments

Without Underline Formatting.pdf With Underline Formatting.pdf

        pdfReader, _, err := model.NewPdfReaderFromFile(filePath, nil)
        if err != nil {
        return nil, fmt.Errorf("failed to create pdf reader: %w", err)
    }
        ex, err := extractor.New(page)
    if err != nil {
        return "", fmt.Errorf("failed to create extractor: %w", err)
    }

    pageText, err := ex.ExtractText()
    if err != nil {
        return "", fmt.Errorf("failed to extract text: %w", err)
    }
    fmt.Printf("Extracted text: %s\n", pageText)

Output with underline formatting:

EXHIBIT A

SCHEDULE OF PURCHASERS

Initial Closing Date: March 30, 2023

Purchaser Investment Amount Preferred Shares
No. of Series Seed
demo+inf@tenkeylabs.com
Investor Fund 1, L.P. $2,000,000 571,428
demo+inf@tenkeylabs.com
Investor Fund 2, L.P. $1,000,000 285,714
demo+inf@tenkeylabs.com
Investor Fund 2-B, L.P. $200,000 57,142
demo+pi1@tenkeylabs.com
Private Investor 1 $75,000 21,428
demo+pi21@tenkeylabs.com
Private Investor 2 $100,000 28,571
demo+pi3@tenkeylabs.com
Private Investor 3 $95,000 27,142
demo+pi4@tenkeylabs.com
Private Investor 4 $95,000 27,142
demo+pi5@tenkeylabs.com
Private Investor 5 $200,000 57,142
demo+pi6@tenkeylabs.com
Private Investor 6 $50,000 14,285
demo+pi7@tenkeylabs.com
Private Investor 7 $75,000 21,428
demo+pi8@tenkeylabs.com
Private Investor 8 $110,000 31,428

TOTALS: $4,000,000 1,142,850

Output without underline formatting:

EXHIBIT A

SCHEDULE OF PURCHASERS

Initial Closing Date: March 30, 2023

Purchaser Investment Amount No. of Series Seed Preferred
Shares
Investor Fund 1, L.P.
demo+inf@tenkeylabs.com $2,000,000 571,428
Investor Fund 2, L.P.
demo+inf@tenkeylabs.com $1,000,000 285,714
Investor Fund 2-B, L.P.
demo+inf@tenkeylabs.com $200,000 57,142
Private Investor 1
demo+pi1@tenkeylabs.com $75,000 21,428
Private Investor 2
demo+pi21@tenkeylabs.com $100,000 28,571
Private Investor 3
demo+pi3@tenkeylabs.com $95,000 27,142
Private Investor 4
demo+pi4@tenkeylabs.com $95,000 27,142
Private Investor 5
demo+pi5@tenkeylabs.com $200,000 57,142
Private Investor 6
demo+pi6@tenkeylabs.com $50,000 14,285
Private Investor 7
demo+pi7@tenkeylabs.com $75,000 21,428
Private Investor 8
demo+pi8@tenkeylabs.com $110,000 31,428

TOTALS: $4,000,000 1,142,850
github-actions[bot] commented 3 weeks ago

Welcome! Thanks for posting your first issue. The way things work here is that while customer issues are prioritized, other issues go into our backlog where they are assessed and fitted into the roadmap when suitable. If you need to get this done, consider buying a license which also enables you to use it in your commercial products. More information can be found on https://unidoc.io/

kellemNegasi commented 1 week ago

Hi @lavens , thank you for reporting this issue with the sample file and code. We were able to reproduce it and we have created a ticket to look into it. We will write an update as soon as we figure this out.