unidoc / unipdf

Golang PDF library for creating and processing PDF files (pure go)
https://unidoc.io
Other
2.5k stars 249 forks source link

[BUG] textTable.growTable index out of range crash #394

Closed Kleissner closed 3 years ago

Kleissner commented 4 years ago

Description

panic: runtime error: index out of range [2] with length 2

goroutine 16 [running]:
github.com/unidoc/unipdf/extractor.(*textTable).growTable.func2(0xc00f8a0f30, 0x2, 0x2)
    C:/Temp/Go/src/github.com/unidoc/unipdf/extractor/text_table.go:137 +0xe4
github.com/unidoc/unipdf/extractor.(*textTable).growTable(0xc00f8d8540)
    C:/Temp/Go/src/github.com/unidoc/unipdf/extractor/text_table.go:150 +0x1df
github.com/unidoc/unipdf/extractor.paraList.findTables(0xc00de50580, 0x2c, 0x2c, 0xc126585410, 0xbb2230, 0xc01e44d5b8)
    C:/Temp/Go/src/github.com/unidoc/unipdf/extractor/text_table.go:69 +0x17f
github.com/unidoc/unipdf/extractor.paraList.extractTables(0xc00de50580, 0x2c, 0x2c, 0xc00de50580, 0x2c, 0x40)
    C:/Temp/Go/src/github.com/unidoc/unipdf/extractor/text_table.go:41 +0x76
github.com/unidoc/unipdf/extractor.makeTextPage(0xc0076a4800, 0x4c1, 0x500, 0x0, 0x0, 0x407b75eb851eb852, 0x4084d122d0e56042, 0x500, 0xd39040, 0xc01f137e70)
    C:/Temp/Go/src/github.com/unidoc/unipdf/extractor/text_page.go:71 +0x406
github.com/unidoc/unipdf/extractor.(*PageText).computeViews(0xc0055a4080)
    C:/Temp/Go/src/github.com/unidoc/unipdf/extractor/text.go:942 +0x244
github.com/unidoc/unipdf/extractor.(*Extractor).ExtractPageText(0xc01e017260, 0xc002a3f710, 0xc01e44d8d8, 0x40e8c4, 0xdcbe80, 0xc002a3f710)
    C:/Temp/Go/src/github.com/unidoc/unipdf/extractor/text.go:57 +0x187
github.com/unidoc/unipdf/extractor.(*Extractor).ExtractTextWithStats(0xc01e017260, 0xc01e017260, 0x0, 0x0, 0x0, 0x0, 0x0)
    C:/Temp/Go/src/github.com/unidoc/unipdf/extractor/text.go:42 +0x47
github.com/unidoc/unipdf/extractor.(*Extractor).ExtractText(...)
    C:/Temp/Go/src/github.com/unidoc/unipdf/extractor/text.go:35

The underlying code has 2 sub-functions growDown and growRight which access a slice without boundary check.

https://github.com/unidoc/unipdf/blob/60bb6967add62f2ebf414668d82043fc7115ad0e/extractor/text_table.go#L126-L140

Here is the file that triggers the crash when attempting to extract its text: 024d74b8-6f7a-4052-a602-1b4f2ba2ddcd.pdf

Expected Behavior

The functions growDown and growRight should check the boundaries of the slice and return the function if out of range.

Kleissner commented 3 years ago

Just uploaded the PDF file (see above original post) that triggers the crash upon text extraction. The same crash was observed with at least 34 other PDF files.

Here is the code to test (command line .\PDF_Extract.exe 024d74b8-6f7a-4052-a602-1b4f2ba2ddcd.pdf)

package main

import (
    "fmt"
    "os"

    "github.com/unidoc/unipdf/common/license"
    "github.com/unidoc/unipdf/extractor"
    pdf "github.com/unidoc/unipdf/model"
)

const unidocLicenseKey string = `
INSERT_LICENSE_HERE
`
const unidocLicenseCustomerName string = "INSERT_LICENSE_HERE"

func init() {
    // load the unidoc license (v3)
    license.SetLicenseKey(unidocLicenseKey, unidocLicenseCustomerName)
}

func main() {
    inputPath := os.Args[1]

    err := outputPdfText(inputPath)
    if err != nil {
        fmt.Printf("Error: %v\n", err)
        os.Exit(1)
    }
}

// outputPdfText prints out contents of PDF file to stdout.
func outputPdfText(inputPath string) error {
    f, err := os.Open(inputPath)
    if err != nil {
        return err
    }

    defer f.Close()

    pdfReader, err := pdf.NewPdfReader(f)
    if err != nil {
        return err
    }

    numPages, err := pdfReader.GetNumPages()
    if err != nil {
        return err
    }

    fmt.Printf("--------------------\n")
    fmt.Printf("PDF to text extraction:\n")
    fmt.Printf("--------------------\n")
    for i := 0; i < numPages; i++ {
        pageNum := i + 1

        page, err := pdfReader.GetPage(pageNum)
        if err != nil {
            return err
        }

        ex, err := extractor.New(page)
        if err != nil {
            return err
        }

        text, err := ex.ExtractText()
        if err != nil {
            return err
        }

        fmt.Println("------------------------------")
        fmt.Printf("Page %d:\n", pageNum)
        fmt.Printf("\"%s\"\n", text)
        fmt.Println("------------------------------")
    }

    return nil
}
Kleissner commented 3 years ago

Fixed with 16b59e59af4d8a2f2d2bd58af7612aa62c0bedda