rougier / freetype-py

Python binding for the freetype library
Other
298 stars 88 forks source link

[Question] How to decode Face.postscript_name? #153

Closed moi15moi closed 1 year ago

moi15moi commented 2 years ago

Face.postscript_name can return bytes.

Is there any way to convert that in string?

rougier commented 2 years ago

Maybe Face.postscript_name.decode("utf-8")?

moi15moi commented 2 years ago

A postscript name could be in any encoding, so that's not a good idea to always take utf-8.

HinTak commented 2 years ago

Postscript names containing non-ascii's should be escaped the postscript way (hex with prefix)? Read the postscript reference manual.

moi15moi commented 2 years ago

Read the postscript reference manual. Where is it available?

Or, is there any way to get the name of the postscript_name? With the name, i could easily convert it to string.

moi15moi commented 1 year ago

I would also want to know how can we decode sfnt name:

import freetype
face = freetype.Face("F5AJJI3A.TTF")

for i in range(face.sfnt_name_count):
    name = face.get_sfnt_name(i)
    print(name.string) # can return bytes

Here is a font example: https://mega.nz/file/S9ERDRpQ#bcPhS06kv-D5jt64aTNDbZVd6gZr6ZfJDYT91yYsoWk

rougier commented 1 year ago

For SNFT name, see https://freetype.org/freetype2/docs/reference/ft2-sfnt_names.html For Postscript_name, see https://freetype.org/freetype2/docs/reference/ft2-base_interface.html#ft_get_postscript_name=

HinTak commented 1 year ago

The postscript name is in plain ascii, the SNFT name is in SJIS encoding - the combination of platform/encoding/language id's said so. You need to call one of the python decoding function to decode bytes as sjis encoding.

The postscript name is Fj-Ima310, the SNFT name should decode to "Fjイーマ310" from "Fj\x83C\x81[\83}310"

HinTak commented 1 year ago

In your code above, you need to read also "name.platform_id", encoding_id and language_id , before deciding how to decode name.string in general.

HinTak commented 1 year ago
>>> name = face.get_sfnt_name(1)
>>> print((name.string).decode("sjis"))
Fjイーマ310
>>> print(name.encoding_id)
0
>>> print(name.language_id)
11
>>> print(name.platform_id)
1

1,0,11 is Japanese SJIS. There is a table linked in the https://freetype.org/freetype2/docs/reference/ft2-sfnt_names.html which tells you what (platform, encoding, language)= (1,0,11) means. You basically needs to check it is (1,0,11) to set "sjis" in the decode argument.

HinTak commented 1 year ago

Extracted from the freetype doc -

#define TT_PLATFORM_MACINTOSH      1
#define TT_MAC_ID_ROMAN                 0
#define TT_MAC_LANGID_JAPANESE                     11
HinTak commented 1 year ago

Some of the other entries look broken, in this font.

>>> name = face.get_sfnt_name(8)
>>> print(name.platform_id)
3
>>> print(name.encoding_id)
2
>>> print(hex(name.language_id))
0x411
#define TT_PLATFORM_MICROSOFT      3
#define TT_MS_ID_SJIS                             2
#define TT_MS_LANGID_JAPANESE_JAPAN                    0x0411

This suggests it is in SJIS too. However, it won't decode as sjis, but needed to be decoded as utf-16-be:

>>> print(name.string.decode("sjis"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'shift_jis' codec can't decode byte 0xfc in position 7: illegal multibyte sequence
>>> print(name.string.decode("utf-16-be"))
Fjイーマ310
>>> 

ie. this font is slightly broken in some of its sfnt names.

moi15moi commented 1 year ago

I know that the encoding will depend on the platform_id and encoding_id (if platform_id = 0, it also depend on language_id).

My problem is that I don't understand how I can get the right encoding from these parameter. Which method should I call? In your example, you harcoded sjis.

But, if I remove all the sfnt name except the one from the platformID 3, I can't decode it. Still, I get the correct name with windows and libass which use freetype: https://mega.nz/file/Oo9ygbJA#7Ri7rlZ0oCxS6slXtfxIpP_VJa1HE6h24PcRRtGDr0E

HinTak commented 1 year ago

Freetype itself does not care about text encodings. It is a library about any arbitrary mapping between text of any (sometimes localized, and sometimes even custom, like a private collections of symbols) encoding to shape. There is nothing inside freetype to call.

There are a few combinations of platform/encoding/language where it means unicode (the newer common standard). For the rest, it means the corresponding localised encoding, back in the 1980's, before unicode. Japan used SJIS, and still do.

As I said, this particular font is buggy in the (3,2,0x411) strings. 3,2,0x411 is japanese and sjis, but the bytes are in utf-16-be, wrongly.

There is no quick/fast way of setting the decoding parameter - given there are about 10+ common localised encodings (cjk is 4 already, simplified chinese = gb18030 vs traditional = big5). The logic is fairly messy:

If (combo = one of the unicode ones ) Do unicode Else if (combo is one of lang1) Do lang1 Else if (combo is one of lang 2) ... etc

As I write a 3rd time now, this particular font is buggy for its (3,2,0x411) name strings. Anyway, you can do "ftdump -n ..." on most fonts and ftdump (one of freetype2-demo programs written by the freetype people, to demonstrate freetype api's... available on most Linux platforms, and buildable for windows too) will try to decode all the strings to unicode or / hex for you. The actual decoding routine is "Print_Sfnt_Names" is about ~110 lines (total is ~1400, so about 10% of it!), from about line 340 onwards, and it is not a pretty thing: it is a few large and nested "switch (x_id) case:... " .

Considering even the freetype people needs to write 110 lines of C code to demonstrate how to decode the sfnt names, and it only converts the utf-16-be ones, and do nothing for the others. Utf-16-be is special: it is the native encoding of the first apple mac in 1980's, when truetype was created.That's your answer - you need to copy that 110 lines of C code, convert that to python, adds a few lines to decode arbitrary names for arbitrary fonts, if that's your goal.

I'll write a 4th time: this particular font is buggy (ie. Off-spec) in the names department. Don't use it for testing your code in this area.

HinTak commented 1 year ago

I have just been reminded in #156 that in our example directory, there is a python version of ftdump.py : https://github.com/rougier/freetype-py/blob/master/examples/ftdump.py - you can see the piles of "if ... elif ..." for the name decoding part.

moi15moi commented 1 year ago

Ok, thank you.

The font is not really "buggy", but it is a special case. With a modified version of fonttools, i can decode everything correctly.

HinTak commented 1 year ago

The font is buggy. The platform/encoding/language tags for the sfnt names don't reflect their encoding correctly. Maybe it is not seriously buggy, but buggy nonetheless. If fonttools shows every strings in human readable form (more than "ftdump -n " is able to show), then it is behaving in a friendly though off-spec (ie buggy) manner.

moi15moi commented 1 year ago

Since it is wrote in the documentation of freetype and adobe that postscript name should only contain ascii character, this seems to be a solution:

if font.postscript_name is not None:
    try:
        decoded_postscript_name = font.postscript_name.decode("ASCII")
    except UnicodeDecodeError:
        print("The font you specified contain an invalid postscript name")
HinTak commented 1 year ago

BTW you can see "copyright 1998" for this particular font. Some of the specs/docs were written later.

It is a work-around: font designers / font editing software do all sort of things , until the community (font creators and font consuming techs) reaches concensus about what is good and what to avoid, and the spec gets updated to reflect concensus . Often old buggy fonts, which are sufficiently useful nonetheless, do not get updated.

I think "contains ascii only" is a "recommendation". Many fonts were created with non-ascii names (for non-english markets, like in this case, Japanese) before it was stated as a poor practice.

HinTak commented 1 year ago

The postscript reference manual is freely downloadable for Adobe.

I am not quite sure about what you are asking now. If it is an encoding issue, it is as I said, the correct way is in the reference; if it is a missing api issue (not all freetype routines are exposed in freetype-py), then we can add it, though I doubt that's the case, since getting at the postscript name is quite an old functionality and should be in; if it is lack of documentation, consulting upstream (freetype's) is in order.

Lastly, for some fonts, it is also possible that the font creator mistakenly put off-spec bytes/encoding there.

Actually, looking at examples/ftdump.py (in the source examples directory on freetype-py), it should just work. what exactly is your problem? The postscript # hex encoding one?

JeremieBergeron commented 1 year ago

If it is an encoding issue, it is as I said, the correct way is in the reference

The adobe reference doesn't say how to decode it. It only say how to create an postscript name.

what exactly is your problem?

Face.postscript_name can return bytes It should always return a string.

It seems freetype always return an ascii bytes, so i think freetype-py should do that:

if font.postscript_name is not None:
    try:
        decoded_postscript_name = font.postscript_name.decode("ASCII")
    except UnicodeDecodeError:
        print("The font you specified contain an invalid postscript name")
HinTak commented 1 year ago

@JeremieBergeron I have already pointed out that the correct way to interprete those bytes is as in ftdump.py example. The example does return a string. It is not a neat two-line of code answer, but it is the answer. The fact that this particular code does not work on this particular font , is because this particular font is buggy, as in it is off-spec. That the font still (partially) works (in some circumstances/ for some usage) is besides the point. Some other part of the font is not buggy, that's what you are claiming, really.

HinTak commented 1 year ago

If you are proposing copying that 100 lines of ftdump.py as a wrapper into the core, that's debatable.

JeremieBergeron commented 1 year ago

Why are you talking about ftdump.py?

It does not decode the byte: https://github.com/rougier/freetype-py/blob/00842126e08a98efc4c550ab667373c4ea4b8154/examples/ftdump.py#L34

HinTak commented 1 year ago

I am not sure what you are asking here. There is an implicit conversion on print. As I explained quite a few times, localised names are as done in ftdump. If the font name is not ascii, it is not ascii. Blindly converting to ascii seems wrong.

There is a better way of encoding localised names (And some font vendors still get it wrong). Historically the postscript name is anything that that font vendor put there, and it works for their intended purpose... and it looks as if font vendors put ascii names, localized names (for its intended locale), utf8 names recently in some cases, and postscript encoded hex in others. What it should be was added later.

If you think the conversion to ascii should be done, it could be added on the client side...

JeremieBergeron commented 1 year ago

I am talking about postscript_name. In the freetype documentation, it is wrote: Retrieve the ASCII PostScript name of a given face, if available. This only works with PostScript, TrueType, and OpenType fonts.

So, it always return an ascii bytes.

Of course, this won't work if I was trying to decode directly the a name in the os2 table, but that's totally different (also, the code in ftdump does not always retrieve the good encoding, see what fonttools have done

HinTak commented 1 year ago

In the case of it being completely normal and ascii, print(font.postscript_name.decode("ASCII")) and print(font.postscript_name) are not that different, visually. One might argue not to convert - python 3 strings internally are not single byte representations, so that will surprise some other people.