Support non-standard fonts in `MarkdownToPDF`

holzschu commented 2 years ago

Is your feature request related to a problem? Please describe.

I am using MarkdownToPdf. My markdown files occasionally contain Japanese characters. When that happens, borb stops with AssertionError: Font Helvetica cannot represent '(glyph)'.

The issue seems to be in character_identifier_to_unicode: https://github.com/jorisschellekens/borb/blob/938c7b256e6f8cf2ca0a658306dda3e37b3fada8/borb/pdf/canvas/font/simple_font/font_type_1.py#L473

which, in turn, points to _character_identifier_to_unicode_lookup. Basically, when Standard Type1 fonts are loaded, the unicode-to-character dictionary is only built for the first 256 characters: https://github.com/jorisschellekens/borb/blob/938c7b256e6f8cf2ca0a658306dda3e37b3fada8/borb/pdf/canvas/font/simple_font/font_type_1.py#L463

Describe the solution you'd like

In all of the environments I have access to (YMMV), Standard Type1 fonts have all UTF-8 characters. If a Type1 font has all UTF-8 characters, I would like to be able to access them from inside borb.

Describe alternatives you've considered

Additional context

This will impact MarkdownToPdf, obviously, but also all the functions in borb that use Standard Type1 fonts.

jorisschellekens commented 2 years ago

Hi there,

I simply adhere to the PDF specification (which can also be found in the repository). This document clearly states the encoding table(s) possible for those particular fonts.

Helvetica does not allow for Japanese characters. Simple as that.

There are several caveats with your statement:

First, programs such as Microsoft Office may decide to (sneakily) substitute a font whenever a user attempts to render unknown characters with it.
Second, there is no real way of enforcing that a particular name corresponds to a particular font (at least not in PDF). So a PDF might claim it is using only Helvetica, and may in reality be using Google Noto. Unless you delve into the spec at a binary level, you would never know the difference.

What I can do for you however is enable other fonts in the MarkdownToPDF. That way you could specify which font you'd like to use.

Kind regards, Joris Schellekens

holzschu commented 2 years ago

Thank you very much for the detailed explanation. Yes, the ability to enable other fonts in MarkdownToPDF would be an excellent solution to the problem.

jorisschellekens commented 2 years ago

I'm almost done rewriting the LayoutElement framework in borb. I've also rewritten MarkdownToPDF and HTMLToPDF.

The latest release (2.1.0) already includes the bulk of those changes.

Now that everything is a bit more stable, I can look into adding extra options to the parsing process, such as allowing the user to define a list of fallback fonts in case the current font is unable to render a particular character.

Thank you for being so patient. We're getting there!

Kind regards, Joris Schellekens

holzschu commented 2 years ago

Thank you very much for your hard work. I know these kind of changes take time to be done right. I'm looking forward to working with the new version.

jorisschellekens commented 2 years ago

Hi there,

I added the feature, and added a test that converts a markdown file containing chinese text to PDF.

I am going to close this ticket, as no further effort is needed. You should find this functionality in the next release. I usually build a release in the weekend.

Kind regards, Joris Schellekens

holzschu commented 2 years ago

Thank you very much for this.

jorisschellekens / borb

Support non-standard fonts in `MarkdownToPDF` #127