jorisschellekens / borb

borb is a library for reading, creating and manipulating PDF files in python.
https://borbpdf.com/
Other
3.38k stars 146 forks source link

Support font subsetting to reduce size of pdf #103

Open Yang-Xijie opened 2 years ago

Yang-Xijie commented 2 years ago

Describe the bug

I want to add Chinese and Japanese in PDF. I did present Chinese and Japanese characters (は哈) successfully, but the size of output.pdf is too large (14MB).

I read the example doc and found the chapter 8.6.2 Composite fonts. I just want to render each character seperately, namely extract the font of a single character and then package these characters in PDF file. How to achieve this using borb? I wonder if there is an exact configuration in borb?

To Reproduce

Steps to reproduce the behaviour:

Download Microsoft Yahei.ttf at https://github.com/dolbydu/font/blob/master/unicode/Microsoft%20Yahei.ttf

from borb.pdf.document.document import Document
from borb.pdf.page.page import Page
from borb.pdf.canvas.layout.page_layout.multi_column_layout import SingleColumnLayout
from borb.pdf.canvas.layout.page_layout.page_layout import PageLayout
from borb.pdf.canvas.layout.text.paragraph import Paragraph
from borb.pdf.pdf import PDF
from borb.pdf.canvas.font.simple_font.true_type_font import TrueTypeFont
import time

from pathlib import Path

def print_current_time():
    print(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))

if __name__ == "__main__":

    print_current_time()

    font_path = Path(__file__).parent / "font" / "Microsoft Yahei.ttf"
    custom_font = TrueTypeFont.true_type_font_from_file(font_path)

    print_current_time()

    doc = Document()
    page = Page()
    doc.append_page(page)
    layout = SingleColumnLayout(page)
    layout.add(Paragraph("はははは哈哈", font=custom_font))

    print_current_time()

    timestamp = time.strftime("%Y_%m_%d_%H_%M_%S", time.localtime())
    pdf_name = timestamp + ".pdf"
    pdf_path = Path(__file__).parent / "pdf" / pdf_name
    with open(pdf_path, "wb") as pdf_file_handle:
        PDF.dumps(pdf_file_handle, doc)

    print_current_time()
2022-05-27 21:19:11
2022-05-27 21:19:26
2022-05-27 21:19:27
2022-05-27 21:20:02
[ 288]  .
├── [  97]  README.md
├── [ 128]  font
│   ├── [ 21M]  Microsoft Yahei.ttf
│   └── [ 74M]  PingFang.ttc
├── [1.3K]  main.py
└── [  96]  pdf
    └── [ 14M]  2022_05_27_20_49_11.pdf

Expected behaviour

The size of PDF file should be less than 1MB.

Desktop (please complete the following information):

jorisschellekens commented 2 years ago

In order to reduce the size of the pdf, borb would need to perform font subsetting.

This is when a pdf contains a special "made up" font that contains only those characters that are actually used in the document.

So for instance, if you created a pdf containing only the text "Hello World" you would find a font inside the pdf that only contains the characters H, e, l, o, W, r and d.

Font subsetting is currently not supported in borb.

Kind regards, Joris Schellekens

Yang-Xijie commented 2 years ago

Thanks for your reply!

Font subsetting is such an important feature for languages with large character sets. Hope that borb will support it soon.

orklann commented 2 years ago

@jorisschellekens As you use fonttools, subsetting TrueType fonts by using fonttools is simple, just see this example.

https://github.com/orklann/caprice/blob/main/caprice/font/truetype/font.py#L89

For none Latin TrueType fonts, subsetting is a important feature, since fonts in this category are always large in size.

jorisschellekens commented 2 years ago

I think I may have found a way to do this.

Both of these files were created with borb, one of them contains a subset Font, and the other does not. It's going to need more tests, and running all the existing tests. But I think this may just work :-)

output_without_subsetting.pdf output_with_subsetting.pdf

jorisschellekens commented 2 years ago

:heavy_check_mark: According to the PDF validator I use (vera pdf), my output is a valid PDF. :heavy_check_mark: The code has been documented, :heavy_check_mark: a test has been added to verify both the subset and not-subset document.

Next I want to try it with your particular font and code, and see whether the results still hold. If that turns out to be the case, this feature will be included in the next release.

Kind regards, Joris Schellekens

jorisschellekens commented 2 years ago

Turns out I already had a test using Simhei.ttf. Same results.

I'm also going to attach the subset version of that PDF to this ticket, so you can verify for yourself. output_001.pdf

That means this feature will be included in the next release :mega:

Kind regards, Joris Schellekens

Yang-Xijie commented 2 years ago

I think I may have found a way to do this.

Both of these files were created with borb, one of them contains a subset Font, and the other does not. It's going to need more tests, and running all the existing tests. But I think this may just work :-)

output_without_subsetting.pdf output_with_subsetting.pdf

These two PDFs looks different using Preview (the default PDF viewer) on macOS 12.4.

output_without_subsetting.pdf

image

output_with_subsetting.pdf

image

It might not be the expected behaviour.

Yang-Xijie commented 2 years ago

Turns out I already had a test using Simhei.ttf. Same results.

  • The font-file is roughly 10Mb big.
  • Without font-subsetting the PDF (containing "你好世界") is 5.5 Mb
  • With font-subsetting the PDF is 3.2 Kb

I'm also going to attach the subset version of that PDF to this ticket, so you can verify for yourself. output_001.pdf

That means this feature will be included in the next release 📣

Kind regards, Joris Schellekens

The attached PDF is blank opening by Preview (the default PDF viewer) on macOS 12.4.

image

However, you said that you added "你好世界" in this PDF. It might not be the expected behavior.

jorisschellekens commented 2 years ago

That is definitely not the expected behaviour.

It's using a substitute font (so it's claiming that it can't find the font file inside the PDF)

Can you open it in Adobe?

Yang-Xijie commented 2 years ago

Chrome 103.0.5060.114 (Official Build) (x86_64) on macOS 12.4

output_without_subsetting.pdf

image

output_with_subsetting.pdf

image

output_001.pdf

image
Yang-Xijie commented 2 years ago

It seems that certain standards of PDF are not satisfied.

Yang-Xijie commented 2 years ago

Adobe Acrobat Reader DC Version 2022.001.20142 on macOS 12.4

Architecture: x86_64 Processor: Intel Build: 22.1.20142.0 AGM: 4.30.117 CoolType: 6.2.1 JP2K: 2.0.6.50420

output_without_subsetting.pdf

image

output_with_subsetting.pdf

image

output_001.pdf

blank

Yang-Xijie commented 2 years ago

It is wierd that I received your comments from email but I cannot find that comment at GitHub.

image

macOS 12.4 Preview.app & Chrome.app & Safari.app

image
jorisschellekens commented 2 years ago

After having discussed this issue with another PDF expert, it seems like the actual subsetting of the font (rather than the dictionaries in the PDF) is going awry.

Sadly, that makes this problem a bit trickier. Currently I use fonttools to do the subsetting. And I'd prefer to keep most of that functionality delegated to an external library.