unidoc / unipdf

Golang PDF library for creating and processing PDF files (pure go)
https://unidoc.io
Other
2.47k stars 250 forks source link

[BUG] text extraction for list not work as exptected #508

Open traitman opened 1 year ago

traitman commented 1 year ago

Description

the text in pdf list on the same line does not on the same line after extract to text.

for example:

6. Miu to kekkon shitai desu.

becomes

6.

Miu to kekkon shitai desu.

Expected Behavior

text extraction for list works OK, the result should be:

6. Miu to kekkon shitai desu.

7. TENDŌ RAIZŌ: Ore.... Miu .... (Fun) 
8. ŌZOYA HARUYA: A... sumimasen. Boku wa Miu-san to kekkon shitai desu. 

9. O, o-jō-san o kudasai.

10. TENDŌ RAIZŌ: Kimi wa... Nani ga hoshii? Kane ka? Ie ka? A?

11. Soretomo uchi no kaisha ga hoshii no ka?

12. TENŌ MIU: Papa!

the second list result in:

ENGLISH

1. ŌZORA HARUYA: (Tendō Family)Ah, nice to meet you. I am Ōzora Haruya.
2. TENZO RAIZO: Huh?
3. ŌZORAHARUYA: ah, um...sir, please give me your daughter.
4. TENDOU RAIZO: Huh?
5. ŌZORA HARUYA: I...I want to be with Miu forever.
6. I want to marry Miu. 
7. TENDO RAIZO: I? Miu? (harrumph)
8. ŌZORA HARUYA: I...I'm sorry. I want to marry Miu. 

Actual Behavior

the list extraction is buggy

for example: image

the first list extracted result in

6.

Miu to kekkon shitai desu.

7. TENDŌ RAIZŌ: Ore.... Miu .... (Fun) 
8. ŌZOYA HARUYA: A... sumimasen. Boku wa Miu-san to kekkon shitai desu. 

9.

10.

11.

12.

O, o-jō-san o kudasai.

TENDŌ RAIZŌ: Kimi wa... Nani ga hoshii? Kane ka? Ie ka? A?

Soretomo uchi no kaisha ga hoshii no ka?

TENŌ MIU: Papa!

the second list result in:

ENGLISH

1. ŌZORA HARUYA: 
2. TENZO RAIZO: 
3. ŌZORAHARUYA: 
4. TENDOU RAIZO: 
5. ŌZORA HARUYA: 
6. I want to marry Miu. 
7. TENDO RAIZO: 
8. ŌZORA HARUYA:  (Tendō Family)Ah, nice to meet you. I am Ōzora Haruya.

Huh?

ah, um...sir, please give me your daughter.

Huh?

I...I want to be with Miu forever.

I? Miu? (harrumph)

I...I'm sorry. I want to marry Miu.

Attachments

Include a self-contained reproducible code snippet and PDF file that demonstrates the issue. B_S4L4_p4_github.pdf

package main

import (
    "fmt"
    "os"

    "github.com/unidoc/unipdf/v3/extractor"
    "github.com/unidoc/unipdf/v3/model"
)

func main() {
    if err := outputPdfText(os.Args[1]); err != nil {
        panic(err)
    }
}

// outputPdfText prints out contents of PDF file to stdout.
func outputPdfText(inputPath string) error {
    f, err := os.Open(inputPath)
    if err != nil {
        return err
    }

    defer f.Close()

    pdfReader, err := model.NewPdfReader(f)
    if err != nil {
        return err
    }

    numPages, err := pdfReader.GetNumPages()
    if err != nil {
        return err
    }

    fmt.Printf("--------------------\n")
    fmt.Printf("PDF to text extraction:\n")
    fmt.Printf("--------------------\n")
    for i := 0; i < numPages; i++ {
        pageNum := i + 1

        page, err := pdfReader.GetPage(pageNum)
        if err != nil {
            return err
        }

        ex, err := extractor.New(page)
        if err != nil {
            return err
        }

        pt, _, _, err := ex.ExtractPageText()
        if err != nil {
            return err
        }

        text := pt.Text()
        // text, err := ex.ExtractText()
        // if err != nil {
        //  return err
        // }

        fmt.Println("------------------------------")
        fmt.Printf("Page %d:\n", pageNum)
        fmt.Printf("\"%s\"\n", text)
        fmt.Println("------------------------------")
    }

    return nil
}
github-actions[bot] commented 1 year ago

Welcome! Thanks for posting your first issue. The way things work here is that while customer issues are prioritized, other issues go into our backlog where they are assessed and fitted into the roadmap when suitable. If you need to get this done, consider buying a license which also enables you to use it in your commercial products. More information can be found on https://unidoc.io/

traitman commented 1 year ago

related: https://github.com/unidoc/unipdf/issues/38