unidoc / unipdf

Golang PDF library for creating and processing PDF files (pure go)
https://unidoc.io
Other
2.46k stars 250 forks source link

[BUG] Unipdf is messing up the unicode while merging Searchable PDF documents and applying PDF/1-a #550

Closed sagar-kalburgi-ripcord closed 2 months ago

sagar-kalburgi-ripcord commented 3 months ago

Description

When I try merging two searchable PDF documents (produced by an OCR engine) using unipdf I am also specifying that PDF/1-a standard needs to be applied before writing the result of the merge to an output PDF file. But when I open the output PDF file, I see that the unicode is messed up because I am unable to search for any existing word on it, and when I copy some of the text on the output PDF and paste it on a notepad, I see this unrecognizable text:

􀀀ı􀀀ˇ􀀀􀀀􀀀   􀀀􀀀  􀀀 􀀀 􀀀 􀀀􀀀􀀀   􀀀􀀀 ˇ􀀀􀀀  􀀀􀀀 􀀀􀀀 􀀀 ı􀀀 􀀀 􀀀􀀀ˇ􀀀􀀀ı 􀀀   􀀀􀀀  􀀀 􀀀􀀀􀀀   􀀀􀀀  􀀀􀀀  􀀀􀀀􀀀􀀀 

 􀀀􀀀  􀀀􀀀􀀀􀀀    􀀀􀀀􀀀􀀀   ı􀀀 􀀀 􀀀 􀀀 􀀀 􀀀 􀀀 􀀀 􀀀  􀀀 􀀀 􀀀􀀀􀀀   􀀀􀀀  􀀀 􀀀􀀀􀀀   􀀀􀀀  􀀀􀀀 

Expected Behavior

When I open the output PDF file and search for an existing word, it should show up. And when I copy the textual contents of the output PDF file and paste it on a notepad, it should paste the exact text that's present.

Actual Behavior

Steps to reproduce the behavior: Please run the below sample program with a valid Unidoc license and using the attached PDF files.

func main() {
    reader1, _, err := model.NewPdfReaderFromFile("Page 1 of OCRed image0065.pdf", nil)
    reader2, _, err := model.NewPdfReaderFromFile("Page 1 of OCRed image0069.pdf", nil)

    writer := model.NewPdfWriter()

    for _, reader := range []*model.PdfReader{reader1, reader2} {
        numPages, _ := reader.GetNumPages()
        for i := 1; i <= numPages; i++ {
            page, err := reader.GetPage(i)
            if err != nil {
                log.Fatalf("Error getting page %d: %v", i, err)
            }

            err = writer.AddPage(page)
            if err != nil {
                log.Fatalf("Error adding page %d: %v", i, err)
            }
        }
    }

    writer.ApplyStandard(model.StandardApplier(pdfa.NewProfile1A(pdfa.DefaultProfile1Options())))
    err = writer.WriteToFile("merged.pdf")
    if err != nil {
        log.Fatalf("Error writing merged file: %v", err)
    }

    err = checkCompliancePdfA1a("merged.pdf")
    if err != nil {
        panic(err)
    }

    fmt.Printf("The document is compliant with the standard PDF/A-1a\n")

}

func checkCompliancePdfA1a(fileName string) error {
    // Open up the file with given name.
    f, err := os.Open(fileName)
    if err != nil {
        return err
    }
    defer f.Close()

    // Prepare compliant document reader.
    r, err := model.NewCompliancePdfReader(f)
    if err != nil {
        return err
    }

    // Define to which standard we want to check document compliance.
    profile1A := pdfa.NewProfile1A(pdfa.DefaultProfile1Options())

    // Verify the standard.
    if err = profile1A.ValidateStandard(r); err != nil {
        return err
    }

    return nil
}

Attachments

2 PDF files have been attached

sagar-kalburgi-ripcord commented 3 months ago

Also if it helps, the two documents that I attached to this ticket claim to be PDF/A-1a compliant. But when I try validating either of these documents using unipdf PDF/A-1a validation function, the validation fails with some errors. But again if I try to enforce PDF/A-1a standard on the merged document, the unicode gets messed up and I am unable to search for any text.

sagar-kalburgi-ripcord commented 3 months ago

Page 1 of OCRed image0069.pdf Delta.pdf

Please find the two PDF files I used for the sample program

anovik commented 2 months ago

Hello @sagar-kalburgi-ripcord we have fixed your issue, it was merged into development branch of unipdf source repository (I believe you have access to it and can test it).

It will be included in the next release of UniPDF as well, we will let you know when it will be out.

sagar-kalburgi-ripcord commented 2 months ago

Hi @anovik,

Thanks! Sure I will test it. Any idea when the next release of UniPDF is going to be?

anovik commented 2 months ago

@sagar-kalburgi-ripcord It is planned for the end of April.

sagar-kalburgi-ripcord commented 2 months ago

@anovik I tested the fix from your development branch and it looks good! I'm afraid that end of April is late for us. This issue is critically impacting Production and we are losing revenue for every day that passes with this issue being active. Would it be possible for you to provide a hotfix release for this at the earliest possible?

anovik commented 2 months ago

@sagar-kalburgi-ripcord We completely understand the urgency of your situation and are prioritizing the release of a hotfix to address this issue as quickly as possible.

We'll keep you updated on the progress and notify you as soon as the release is completed.

sagar-kalburgi-ripcord commented 2 months ago

ok thanks!

anovik commented 2 months ago

@sagar-kalburgi-ripcord The new release of UniPDF is available https://github.com/unidoc/unipdf/releases/tag/v3.57.0 and it includes this issue.

Closing the current ticket, feel free to re-open it in case of any problems.