Soft hyphens are searchable as spaces

ghost commented 1 year ago

Input:

birth\u{ad}day
birthday // also contains a soft hyphen

In the resulting PDF both soft hyphens are searchable as spaces (birth day). Expected result would be that the word is searchable as birthday (no soft hyphen included) or as birthday (with soft hyphen included). Including a space instead, changes the semantics of the text.

With LuaTeX, soft hyphens are not included in the searchable text. Maybe related: https://github.com/typst/typst/issues/479

Enivex commented 4 months ago

The same is true for zero width space.

TS60 commented 1 month ago

I could debug it as follows:

The write_normal_text has the correct soft hyphen char for the soft hyphen glyph in the glyph_set
The create_cmap function already has the normal space for a soft hyphen glyph in the glyph_set

Between these two something happens that replaces the text in the glyph_set for the soft hyphen glyph from the correct char to the normal space.

having the normal improve_glyph_sets will result in a space for the SHY
returning in the first line of improve_glyph_sets (do nothing) will give the correct text at create_cmap, but now, normal spaces are missing or even replaced by SHY.

Also the PDF cmap seems to correctly contain the SHY after the change to improve_glyph_sets (00AD is the SHY):

9 beginbfchar
<0003> <00AD>
<0044> <0061>
<0045> <0062>
<0047> <0064>
<004B> <0068>
<004C> <0069>
<0055> <0072>
<0057> <0074>
<005C> <0079>
endbfchar
endcmap

In the PDF readers I can now copy the SHY like expected. But as said above, the normal spaces don't copy as expected anymore. I think, we need a special case in the improve_glyph_sets: do like before but not for SHY and ZWS.

Also, the word now seems to be created as multiple text block and it is only searchable up to the first SHY. With LuaTeX it is all in one block and searchable:

\documentclass[a4paper]{article}

\begin{document}

birthday birthday

\end{document}

Maybe you have an idea about it, @laurmaedje?

laurmaedje commented 1 month ago

The problem is that harfbuzz outputs the same glyph for both U+20 (SPACE) and U+AD (SHY), just with different advance width. And since the PDF content is just glyphs + a glyph -> char mapping, it will copy the same char. I don't think we can fix this by changing improve_glyph_sets.

I guess we could try to filter out space glyphs with zero advance width or something like that. But I'm not sure whether that would be correct behaviour. Would need to look closer at what other software is doing.

TS60 commented 1 month ago

Thanks! Indeed it seems to vary with LuaTeX based on the font (not yet found the pattern). Maybe it works only if the font actually has a glyph for a soft hyphen. Otherwise, it just filters it out. It seems the harfbuzz flag is called REMOVE_DEFAULT_IGNORABLES to activate this.

Update: I now tested also other fonts (Arial, Source Code Pro) and they have the same problem.

LuaTeX does not include the soft hyphen in the searchable text and Typst does include a space instead in the searchable text. The rendered text looks the same for both with the fonts I tested. So, I assume, LuaTeX just activates the equivalent of REMOVE_DEFAULT_IGNORABLES.

khaledhosny commented 1 month ago

The glyph to code point mapping in the PDF should be based on the input string not the output glyph alone (using HarfBuzz’s clusters to map glyphs to input code points), but OpenType and HarfBuzz allow complex glyph to code point mappings (one to one, one to many, many to one and many to many) while PDF cmap allows only one to one and one to many. The same glyph can also be output from different code points, but PDF cmap does not support that either (which is the case here). To get the best text extraction out of PDF, a combination of cmap ToUnicode as ActualText tagging needs to be used. Use cmap whenever possible (if the glyph to code point mapping is unique, and single glyph is mapped to single or multiple code points) and if not fallback to ActualText.

Note that LuaTeX will do this differently when using luaotfload’s default shaper and when using HarfBuzz, the later should support more complex cases, so check that too when comparing LuaTeX PDFs with typst’s.

laurmaedje commented 1 month ago

Thanks for the explanation! We already collect the reverse mapping from the cluster information, but don't yet write an ActualText if there are two different mappings. Good information that that's the way to go!

TS60 commented 1 month ago

Thanks for the information! In the LuaTeX PDF (compression turned off) I cannot find /ActualText and it still works. If I remember correct, /ActualText is very poorly supported by PDF readers and thus maybe wouldn't help for most pdf readers. When the behaviour of LuaTeX would work with Typst, /ActualText could be used as an addition, so that poor pdf readers would still have the correct result.

As I said above, REMOVE_DEFAULT_IGNOREABLES sounds very much to be the difference between the searchable text in LuaTeX and Typst: "Flag indication that character with Default_Ignorable Unicode property should be removed from glyph string instead of hiding them (done by replacing them with the space glyph and zeroing the advance width." (source).

The soft hyphen seems to be such a character and the replacement is a space char. This basically exactly matches the description. I tried to activate this, but have not found the place in the typst source code.

khaledhosny commented 1 month ago

Removing Default_Ignorable might work here, but it will make things worse for other situations. For example, it will also remove ZWNJ, but this changes the meaning of the text in many languages.

TS60 commented 1 month ago

LuaTeX also does the zwnj as expected: the glyph is kept and the searchable text is a space. I think when harfbuzz/rustybuzz does not have a setting to support this directly, Typst should define a set of chars that should be removed and an other set of chars that should be kept and be searchable by a normal space.

Of course, additionally the /ActualText method can be added later to get the original text. For LuaTeX I think they don't do it because afaik this requires PDF 1.7, but they require a minimum version of 1.5. Typst has 1.7 by default, so it can be done here.

typst / typst

Soft hyphens are searchable as spaces #526