jung-kurt / gofpdf

A PDF document generator with high level support for text, drawing and images
http://godoc.org/github.com/jung-kurt/gofpdf
MIT License
4.29k stars 772 forks source link

Generated pdf can't be validated by verapdf when using AddUTF8FontFromBytes #304

Closed sify21 closed 4 years ago

sify21 commented 4 years ago

verapdf is a pdf/a validation tool. When using default font Arial, verapdf can validate the pdf with some warnings. Untitled

But when using font NotoSansSC, which is added by AddUTF8FontFromBytes, the pdf can't be validated. verapdf reports an error:

Sep 20, 2019 3:50:21 PM org.verapdf.processor.ProcessorImpl validate
WARNING: Exception caught when validating item
org.verapdf.core.ValidationException: Caught unexpected runtime exception during validation
    at org.verapdf.pdfa.validation.validators.BaseValidator.validate(BaseValidator.java:95)
    at org.verapdf.processor.ProcessorImpl.validate(ProcessorImpl.java:219)
    at org.verapdf.processor.ProcessorImpl.process(ProcessorImpl.java:120)
    at org.verapdf.processor.BatchFileProcessor.processItem(BatchFileProcessor.java:98)
    at org.verapdf.processor.BatchFileProcessor.processList(BatchFileProcessor.java:74)
    at org.verapdf.processor.AbstractBatchProcessor.process(AbstractBatchProcessor.java:102)
    at org.verapdf.gui.ValidateWorker.doInBackground(ValidateWorker.java:118)
    at org.verapdf.gui.ValidateWorker.doInBackground(ValidateWorker.java:53)
    at javax.swing.SwingWorker$1.call(SwingWorker.java:295)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at javax.swing.SwingWorker.run(SwingWorker.java:334)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: Error while parsing object : 5 0
    at org.verapdf.cos.COSDocument.getObject(COSDocument.java:226)
    at org.verapdf.cos.COSIndirect.getDirect(COSIndirect.java:368)
    at org.verapdf.gf.model.visitor.cos.pb.GFCosVisitor.visitFromIndirect(GFCosVisitor.java:137)
    at org.verapdf.gf.model.impl.cos.GFCosObject.getFromValue(GFCosObject.java:70)
    at org.verapdf.gf.model.impl.cos.GFCosDict.getValues(GFCosDict.java:115)
    at org.verapdf.gf.model.impl.cos.GFCosDict.getLinkedObjects(GFCosDict.java:86)
    at org.verapdf.pdfa.validation.validators.BaseValidator.addAllLinkedObjects(BaseValidator.java:199)
    at org.verapdf.pdfa.validation.validators.BaseValidator.checkNext(BaseValidator.java:166)
    at org.verapdf.pdfa.validation.validators.BaseValidator.validate(BaseValidator.java:117)
    at org.verapdf.pdfa.validation.validators.BaseValidator.validate(BaseValidator.java:93)
    ... 13 more
Caused by: java.io.IOException: PDFParser::GetDictionary()invalid pdf dictionary
    at org.verapdf.parser.COSParser.getDictionary(COSParser.java:250)
    at org.verapdf.parser.COSParser.nextObject(COSParser.java:179)
    at org.verapdf.parser.PDFParser.getObject(PDFParser.java:282)
    at org.verapdf.io.Reader.getObject(Reader.java:122)
    at org.verapdf.io.Reader.getObject(Reader.java:95)
    at org.verapdf.cos.COSDocument.getObject(COSDocument.java:217)
    ... 22 more

Does invalid pdf dictionary mean that the pdf format is broken?

jung-kurt commented 4 years ago

Thanks for the report, @sify21. I am glad to know about the verapdf tool for validating PDF/A compliance. This is an enhancement that was brought up in #144.

The problem related to AddUTF8FontFromBytes appears to be two-fold. Clearly something is going wrong with the UTF-8 implementation in gopdf. Additionally, the problem causes verapdf to crash so this should be brought to the attention of the authors at verapdf.org.

Does invalid pdf dictionary mean that the pdf format is broken?

My guess is that the UTF-8 feature in gofpdf generates some faulty output that isn't severe enough to cause problems with PDF readers but does not conform to the standard. Any help tracking this down will be greatly appreciated.

THausherr commented 4 years ago

The error is this: "/BaseFont /utf8noto sans sc" and "/BaseFont /utf8noto sans scB". The spaces are wrong.

sify21 commented 4 years ago

After removing white spaces from the familyStr variable of AddUTF8FontFromBytes, the generated pdf can be parsed by verapdf. But I don't know whether verapdf or gofpdf should change its behavior. Does the pdf format specify that font names shouldn't contain white spaces? @THausherr

sify21 commented 4 years ago

Ok, I verified that if the familyStr contains whitespace, the generated pdf can't be viewed properly on Mac. So maybe there should be a note in the function comment or add some code in function body to remove whitespaces. @jung-kurt

sify21 commented 4 years ago

Citing from 5.3 part of this article:

The names in PDF documents are represented by a sequence of ASCII characters in the range 0x21 – 0x7E. The exception are the characters: %, (, ), <, >, [, ], {, }, / and #, which must be preceded by a slash. An alternative representation of the characters is with their hexadecimal equivalent, preceded by the character ‘#’. There is a limitation of the length of the name element, which may be only 127 bytes long.

When writing a name a slash must be used to introduce a name; the slash is not part of the name, but is a prefix indicating that what follows is a sequence of characters representing the name. If we want to use whitespace or any other special character as part of the name it must be encoded with 2-digit hexadecimal notation.

jung-kurt commented 4 years ago

@sify21, @THausherr: could you test my fix for this? In branch issue_304, I escape spaces in the font family string by replacing them with "#20".

Create a directory named "issue_304" and place the following files in it:

main.go

package main

import (
  "fmt"

  "github.com/jung-kurt/gofpdf/v2"
)

func main() {

  pdf := gofpdf.New("L", "mm", "A4", "")
  pdf.AddUTF8Font("a b", "", "../NotoSansSC-Regular.ttf")
  pdf.SetFont("a b", "", 16)
  pdf.AddPage()
  pdf.Write(20, "Hello, 世界")
  err := pdf.OutputFileAndClose("issue_304.pdf")
  if err != nil {
    fmt.Print("%s\n", err)
  }
}

go.mod

module issue_304

go 1.13

require github.com/jung-kurt/gofpdf/v2 v2.13.1-0.20190924173108-f7e9373a76e7

go.sum

github.com/jung-kurt/gofpdf/v2 v2.13.1-0.20190924173108-f7e9373a76e7 h1:77NlLaLGz5bzEhvHefYSafiJoKT0yxEbK1qJLuPbrsw=
github.com/jung-kurt/gofpdf/v2 v2.13.1-0.20190924173108-f7e9373a76e7/go.mod h1:RF/RGAP0AS4rd9fVZ6gb7Lbw6178P/AdAxMRW8Kn/Vk=

You will need to adjust the location of the font in the call to AddUTF8Font(). Then issue go build -v and ./issue_304.

sify21 commented 4 years ago

@jung-kurt The result is ok to me. Untitled And verapdf doesn't crash on validating.

THausherr commented 4 years ago

@jung-kurt I can't test your fix because I don't use GO, sorry. If you attach it here, I can have a look at the result PDF with PDFDebugger to see if PDFBox has any complaints. (I did so with the first file, and I was told the exact "bad" offsets in log messages)

But from the screenshot this looks good.

sify21 commented 4 years ago

@THausherr This is the pdf file I generated locally. issue_304.pdf

THausherr commented 4 years ago

works fine 👍

jung-kurt commented 4 years ago

Thanks, @sify21 and @THausherr -- I will merge changes.

jung-kurt commented 4 years ago

Merged into master (cae7d4739e815a170819d84c5361b05306b2f019) and v2 (f7e9373a76e736ecc3aff31810376ec0486af131). Thanks for your help.