Closed Kleissner closed 3 years ago
Just uploaded the PDF file (see above original post) that triggers the crash upon text extraction. The same crash was observed with at least 34 other PDF files.
Here is the code to test (command line .\PDF_Extract.exe 024d74b8-6f7a-4052-a602-1b4f2ba2ddcd.pdf
)
package main
import (
"fmt"
"os"
"github.com/unidoc/unipdf/common/license"
"github.com/unidoc/unipdf/extractor"
pdf "github.com/unidoc/unipdf/model"
)
const unidocLicenseKey string = `
INSERT_LICENSE_HERE
`
const unidocLicenseCustomerName string = "INSERT_LICENSE_HERE"
func init() {
// load the unidoc license (v3)
license.SetLicenseKey(unidocLicenseKey, unidocLicenseCustomerName)
}
func main() {
inputPath := os.Args[1]
err := outputPdfText(inputPath)
if err != nil {
fmt.Printf("Error: %v\n", err)
os.Exit(1)
}
}
// outputPdfText prints out contents of PDF file to stdout.
func outputPdfText(inputPath string) error {
f, err := os.Open(inputPath)
if err != nil {
return err
}
defer f.Close()
pdfReader, err := pdf.NewPdfReader(f)
if err != nil {
return err
}
numPages, err := pdfReader.GetNumPages()
if err != nil {
return err
}
fmt.Printf("--------------------\n")
fmt.Printf("PDF to text extraction:\n")
fmt.Printf("--------------------\n")
for i := 0; i < numPages; i++ {
pageNum := i + 1
page, err := pdfReader.GetPage(pageNum)
if err != nil {
return err
}
ex, err := extractor.New(page)
if err != nil {
return err
}
text, err := ex.ExtractText()
if err != nil {
return err
}
fmt.Println("------------------------------")
fmt.Printf("Page %d:\n", pageNum)
fmt.Printf("\"%s\"\n", text)
fmt.Println("------------------------------")
}
return nil
}
Fixed with 16b59e59af4d8a2f2d2bd58af7612aa62c0bedda
Description
The underlying code has 2 sub-functions
growDown
andgrowRight
which access a slice without boundary check.https://github.com/unidoc/unipdf/blob/60bb6967add62f2ebf414668d82043fc7115ad0e/extractor/text_table.go#L126-L140
Here is the file that triggers the crash when attempting to extract its text: 024d74b8-6f7a-4052-a602-1b4f2ba2ddcd.pdf
Expected Behavior
The functions growDown and growRight should check the boundaries of the slice and return the function if out of range.