rougier / freetype-py

Python binding for the freetype library
Other
298 stars 88 forks source link

`load_char` failed for custom fonts with right input #178

Closed NewUserHa closed 9 months ago

NewUserHa commented 9 months ago
face = freetype.Face(r"")
face.load_char(n)

got a char say n = 236(0xec) using "FontForge", but it throw FT_Exception: FT_Exception: (invalid argument). however, if n is 0 to count of all used slots, it can work (just like load_glyph).

there're names of glyph, but it seems that freetype can't get glyph by its name.

how to fix this, or load_glyph by itsname?

HinTak commented 9 months ago

Without actual code snipplet, it is difficult to tell what you are trying to do. Freetype-py is a fairly simple ctypes wrapper around upstream, so you need to read upstream documentation. I believe you need to read about FT_Get_Name_Index and see whether it does what you want. https://freetype.org/freetype2/docs/reference/ft2-information_retrieval.html#ft_get_name_index .

Btw, you need to use python bytearrays (b'name') as arguments to freetype routine taking c strings, in python 3.

I'll close this for now, as this should answer your question.

HinTak commented 9 months ago

load_char takes the character value (these days, for most people, the unicode value; see addendum below) as a number, mostly, I think.

Addendum: I shall not confuse you with localised encoding like big5, gbk and jis etc. Officially load_char really takes the character value in the current active/default encoding of the font. For most recent usage that's the unicode value, but really it is the current encoded value in the default/current encoding.

HinTak commented 9 months ago

I checked that you can use FT_Get_Name_Index in freetype-py, https://github.com/rougier/freetype-py/blob/83bf5d32cd296795bb790f4fa89fc85c78f50630/freetype/raw.py#L110 .

I think you just use it like glyph_id = FT_Get_Name_Index(face._FT_Face, b'myglyname'). (The face._FT_Face construct is just to get at the lower-level handle ). There might be more python way of doing this elsewhere in freetype-py I am not aware of, such as a face.name_index routine, maybe.

NewUserHa commented 9 months ago

Thanks for your replay, I'll try FT_Get_Name_Index. (It didn't show up in auto-complete list)

I found that load_char actually takes integer as input.

I just wondered why FontForge can display the correct integer/position while freetype can't, and both they did't have cmap file, and think it may be a bug.

NewUserHa commented 9 months ago

Should I report that freetype can't read the correct positions of glyphs like FontForge as bug to upstream?

HinTak commented 9 months ago

Fontforge, as a font editor, probably will try very hard to let you access/manipulate incomplete font structures. Freetype has a slightly higher expectation of font being valid / complete. Anyway, load_glyph should always work, and you should be able to go through the whole from 0 to max, if desperate.

NewUserHa commented 9 months ago

But if use load_glyph as the final choice then it's hard to use the font to extract text.

however, the positions of glyph Fontforge reported are all correct.

Luckily this font has glyph names and the pdf has a cidmap, but what if there's a font that doesn't.

I'm not familar with fonts so don't know if freetype can get the correct positions as well

HinTak commented 9 months ago

If you are thinking of extracting text from pdf, I think it is a generally unsolvable problem - there needs to be some way of mapping glyph id to char code, be it cmap or cidmap.

Very old pdf's sometimes don't have this, so it is not possible. Somewhat more recent there is an implied but undocumented Identity cidmap, which is just saying that the charcode is the same as the glyph id. (Ie the glyph id is basically the unicode value, or the localised encoding value, if you are dealing with a localised pdf).

NewUserHa commented 9 months ago

yes, it is an unsolvable problem.

I'm trying to use OCR to map those custom char code to real characters.

the pdf I have has a cidmap that maps char codes to glyph names (like /37/G25). somehow, fontforge can display that too (like 37 (0x25) "G25") without any cidmap input.

HinTak commented 9 months ago

You probably want to look at mupdf / mutool and pymupdf for that sort of thing. It has text extraction and OCR api's for pdf's.

NewUserHa commented 9 months ago

thanks for the replay.

I checked those and mupdf, it seems that it doesn't have info about OCR in document. mutool, but it seems to OCR the entire pdf page. pymupdf, I found https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/OCR/tesseract1.py, it ocr by line.

But, the pdf I have has codes, OCR by entire lines or pages has issues with brackets (square brackets, angle brackets).

NewUserHa commented 9 months ago

@HinTak Finally, after discussing at freetype repo, I found the issue at the beginning actually is that the auto-loaded face.charmap is garbage data and it needs to manually set the charmap(0) (the font has an adobe custom charmap and nums_charmap is 1). so I think this probably is a bug and now replay again.

HinTak commented 9 months ago

I would argue it isn't a bug - it is as I commented earlier, fontforge can cope with incomplete or work-in-progress fonts, freetype expects largely valid fonts. There is only so much inconsistency or brokenness it would try to cope; so rejecting/refusing to load a broken font, or broken part of a font, is a not a bug.

NewUserHa commented 9 months ago

But "FT_Face->charmap is zero-initialized before any action on Unicode is taken.", and I found face.charmap != face.charmaps[0]

apodtele commented 9 months ago

These are not broken fonts. They simply lack Unicode charmap. FreeType presents them with FT_Face->charmap = NULL. The issue is that freetype-py presents them with garbage in face.charmap. That is not exactly how bindings are supposed to behave. Please zero-initialize face.charmap. That is all you need to do to fairly mimic FreeType.

apodtele commented 9 months ago

Fontforge falls back on the FT_Face->charmaps[0], i.e. the whatever encoding. A sane program should force a user to make this choice explicitly, if he really means and cares about the encoding.

HinTak commented 9 months ago

I doubt that. Anyway, if you want to look at that, the code to modified/ etc is probably around: https://github.com/rougier/freetype-py/blob/83bf5d32cd296795bb790f4fa89fc85c78f50630/freetype/__init__.py#L2042

HinTak commented 9 months ago

Or this:

https://github.com/rougier/freetype-py/blob/83bf5d32cd296795bb790f4fa89fc85c78f50630/freetype/__init__.py#L1962

Freetype-py is just reading face->charmap or face->charmaps

apodtele commented 9 months ago

Can you check if both of these definitions are actually executed when the Python class is created?

FT_ALLOC here means that everything is zeroed initially. It cannot be non-zero invalid pointer.

NewUserHa commented 9 months ago

there's the font QGNGZCFzBookMaker1.patch the extension is for bypassing github

HinTak commented 9 months ago

Pull welcomed. If somebody (else) wants to work on it.

apodtele commented 9 months ago

I am not a Python person but I see that FT_Charmap is not the same as FT_CharMap in FreeType. Therefore, instead of

family_name = property(lambda self: self._FT_Face.contents.family_name,

there is more complex processing

charmap = property( _get_charmap,

In other words, it is not a straight copy. Then

return Charmap( self._FT_Face.contents.charmap)

probably chokes on NULL.

HinTak commented 9 months ago

self._FT_Face.contents.charmap is a straight forward copy. It is ctypes' way of saying ..._FT_Face->charmap in c.

apodtele commented 9 months ago

Yes. But what Charmap:__init__ is doing?

HinTak commented 9 months ago

It is a straightforward copy:

https://github.com/rougier/freetype-py/blob/83bf5d32cd296795bb790f4fa89fc85c78f50630/freetype/__init__.py#L531

apodtele commented 9 months ago

I see that but I do not see any NULL handling. Are you saying that NULL is copied and everything else in Charmap is ignored automatically? Python magic?

apodtele commented 9 months ago

Note that _get_charmaps does not have to handle NULL because num_charmaps protects it. On the other hand, _get_charmap has to handle NULL, which I do not see.

apodtele commented 9 months ago

Shouldn't there be a None assignment when the input is NULL? I just read about it, but I am not an expert.