Closed holzschu closed 2 years ago
Hi there,
I simply adhere to the PDF specification (which can also be found in the repository). This document clearly states the encoding table(s) possible for those particular fonts.
Helvetica does not allow for Japanese characters. Simple as that.
There are several caveats with your statement:
First, programs such as Microsoft Office may decide to (sneakily) substitute a font whenever a user attempts to render unknown characters with it.
Second, there is no real way of enforcing that a particular name corresponds to a particular font (at least not in PDF). So a PDF might claim it is using only Helvetica, and may in reality be using Google Noto. Unless you delve into the spec at a binary level, you would never know the difference.
What I can do for you however is enable other fonts in the MarkdownToPDF
. That way you could specify which font you'd like to use.
Kind regards, Joris Schellekens
Thank you very much for the detailed explanation. Yes, the ability to enable other fonts in MarkdownToPDF
would be an excellent solution to the problem.
I'm almost done rewriting the LayoutElement
framework in borb
. I've also rewritten MarkdownToPDF
and HTMLToPDF
.
The latest release (2.1.0) already includes the bulk of those changes.
Now that everything is a bit more stable, I can look into adding extra options to the parsing process, such as allowing the user to define a list of fallback fonts in case the current font is unable to render a particular character.
Thank you for being so patient. We're getting there!
Kind regards, Joris Schellekens
Thank you very much for your hard work. I know these kind of changes take time to be done right. I'm looking forward to working with the new version.
Hi there,
I added the feature, and added a test that converts a markdown file containing chinese text to PDF.
I am going to close this ticket, as no further effort is needed. You should find this functionality in the next release. I usually build a release in the weekend.
Kind regards, Joris Schellekens
Thank you very much for this.
Is your feature request related to a problem? Please describe.
I am using
MarkdownToPdf
. My markdown files occasionally contain Japanese characters. When that happens,borb
stops withAssertionError: Font Helvetica cannot represent '(glyph)'
.The issue seems to be in
character_identifier_to_unicode
: https://github.com/jorisschellekens/borb/blob/938c7b256e6f8cf2ca0a658306dda3e37b3fada8/borb/pdf/canvas/font/simple_font/font_type_1.py#L473which, in turn, points to
_character_identifier_to_unicode_lookup
. Basically, when Standard Type1 fonts are loaded, the unicode-to-character dictionary is only built for the first 256 characters: https://github.com/jorisschellekens/borb/blob/938c7b256e6f8cf2ca0a658306dda3e37b3fada8/borb/pdf/canvas/font/simple_font/font_type_1.py#L463Describe the solution you'd like
In all of the environments I have access to (YMMV), Standard Type1 fonts have all UTF-8 characters. If a Type1 font has all UTF-8 characters, I would like to be able to access them from inside borb.
Describe alternatives you've considered
Additional context
This will impact
MarkdownToPdf
, obviously, but also all the functions in borb that use Standard Type1 fonts.