internetarchive / archive-pdf-tools

Fast PDF generation and compression. Deals with millions of pages daily.
https://archive-pdf-tools.readthedocs.io/en/latest/
GNU Affero General Public License v3.0
86 stars 13 forks source link

Add another font beyond the glyphless font to actually render fonts of the languages that are in use #4

Open MerlijnWajer opened 2 years ago

MerlijnWajer commented 2 years ago

There is an old branch here that implements the concept:

https://github.com/internetarchive/archive-pdf-tools/tree/show-text-on-selection

It looks a bit messy, and the code was older (wrt font sizes when I wrote it), but something was working back then:

image

image

This table is a set of fonts that we could expect to have around I believe (system wide?):

Font Name Installed Base Font Comments
china-s Heiti simplified Chinese
china-ss Song simplified Chinese (serif)
china-t Fangti traditional Chinese
china-ts Ming traditional Chinese (serif)
japan Gothic Japanese
japan-s Mincho Japanese (serif)
korea Dotum Korean
korea-s Batang Korean (serif)

Then the question becomes -- what do we do for Arabic fonts?

We will want to add the language to the word data as returned by archive-hocr-tools, and then on a per page basis insert the right font.

(Old bug: https://git.archive.org/merlijn/archive-pdf-tools/-/issues/4)