It looks a bit messy, and the code was older (wrt font sizes when I wrote it), but something was working back then:
This table is a set of fonts that we could expect to have around I believe (system wide?):
Font Name
Installed Base Font
Comments
china-s
Heiti
simplified Chinese
china-ss
Song
simplified Chinese (serif)
china-t
Fangti
traditional Chinese
china-ts
Ming
traditional Chinese (serif)
japan
Gothic
Japanese
japan-s
Mincho
Japanese (serif)
korea
Dotum
Korean
korea-s
Batang
Korean (serif)
Then the question becomes -- what do we do for Arabic fonts?
We will want to add the language to the word data as returned by archive-hocr-tools, and then on a per page basis insert the right font.
(Old bug: https://git.archive.org/merlijn/archive-pdf-tools/-/issues/4)
There is an old branch here that implements the concept:
https://github.com/internetarchive/archive-pdf-tools/tree/show-text-on-selection
It looks a bit messy, and the code was older (wrt font sizes when I wrote it), but something was working back then:
This table is a set of fonts that we could expect to have around I believe (system wide?):