Closed sify21 closed 4 years ago
Thanks for the report, @sify21. I am glad to know about the verapdf tool for validating PDF/A compliance. This is an enhancement that was brought up in #144.
The problem related to AddUTF8FontFromBytes
appears to be two-fold. Clearly something is going wrong with the UTF-8 implementation in gopdf. Additionally, the problem causes verapdf to crash so this should be brought to the attention of the authors at verapdf.org.
Does
invalid pdf dictionary
mean that the pdf format is broken?
My guess is that the UTF-8 feature in gofpdf generates some faulty output that isn't severe enough to cause problems with PDF readers but does not conform to the standard. Any help tracking this down will be greatly appreciated.
The error is this: "/BaseFont /utf8noto sans sc" and "/BaseFont /utf8noto sans scB". The spaces are wrong.
After removing white spaces from the familyStr
variable of AddUTF8FontFromBytes
, the generated pdf can be parsed by verapdf. But I don't know whether verapdf or gofpdf should change its behavior. Does the pdf format specify that font names shouldn't contain white spaces? @THausherr
Ok, I verified that if the familyStr contains whitespace, the generated pdf can't be viewed properly on Mac. So maybe there should be a note in the function comment or add some code in function body to remove whitespaces. @jung-kurt
Citing from 5.3 part of this article:
The names in PDF documents are represented by a sequence of ASCII characters in the range 0x21 – 0x7E. The exception are the characters: %, (, ), <, >, [, ], {, }, / and #, which must be preceded by a slash. An alternative representation of the characters is with their hexadecimal equivalent, preceded by the character ‘#’. There is a limitation of the length of the name element, which may be only 127 bytes long.
When writing a name a slash must be used to introduce a name; the slash is not part of the name, but is a prefix indicating that what follows is a sequence of characters representing the name. If we want to use whitespace or any other special character as part of the name it must be encoded with 2-digit hexadecimal notation.
@sify21, @THausherr: could you test my fix for this? In branch issue_304, I escape spaces in the font family string by replacing them with "#20".
Create a directory named "issue_304" and place the following files in it:
main.go
package main
import (
"fmt"
"github.com/jung-kurt/gofpdf/v2"
)
func main() {
pdf := gofpdf.New("L", "mm", "A4", "")
pdf.AddUTF8Font("a b", "", "../NotoSansSC-Regular.ttf")
pdf.SetFont("a b", "", 16)
pdf.AddPage()
pdf.Write(20, "Hello, 世界")
err := pdf.OutputFileAndClose("issue_304.pdf")
if err != nil {
fmt.Print("%s\n", err)
}
}
go.mod
module issue_304
go 1.13
require github.com/jung-kurt/gofpdf/v2 v2.13.1-0.20190924173108-f7e9373a76e7
go.sum
github.com/jung-kurt/gofpdf/v2 v2.13.1-0.20190924173108-f7e9373a76e7 h1:77NlLaLGz5bzEhvHefYSafiJoKT0yxEbK1qJLuPbrsw=
github.com/jung-kurt/gofpdf/v2 v2.13.1-0.20190924173108-f7e9373a76e7/go.mod h1:RF/RGAP0AS4rd9fVZ6gb7Lbw6178P/AdAxMRW8Kn/Vk=
You will need to adjust the location of the font in the call to AddUTF8Font()
. Then issue go build -v
and ./issue_304
.
@jung-kurt The result is ok to me.
And verapdf doesn't crash on validating.
@jung-kurt I can't test your fix because I don't use GO, sorry. If you attach it here, I can have a look at the result PDF with PDFDebugger to see if PDFBox has any complaints. (I did so with the first file, and I was told the exact "bad" offsets in log messages)
But from the screenshot this looks good.
@THausherr This is the pdf file I generated locally. issue_304.pdf
works fine 👍
Thanks, @sify21 and @THausherr -- I will merge changes.
Merged into master (cae7d4739e815a170819d84c5361b05306b2f019) and v2 (f7e9373a76e736ecc3aff31810376ec0486af131). Thanks for your help.
verapdf is a pdf/a validation tool. When using default font Arial, verapdf can validate the pdf with some warnings.![Untitled](https://user-images.githubusercontent.com/11829223/65323030-e900ad80-dbda-11e9-8db9-be447c3447d7.png)
But when using font NotoSansSC, which is added by
AddUTF8FontFromBytes
, the pdf can't be validated. verapdf reports an error:Does
invalid pdf dictionary
mean that the pdf format is broken?