Native unicode U+00A9 (soft hyphen) support

In the continuity of #1918, I'd like soft hyphens U+00A9 to be properly supported.

The most direct solution is described in https://github.com/sile-typesetter/sile/discussions/1716#discussioncomment-7740054 (but with other solutions discussed just before).

I'd like to suggest it for inclusion in SILE, with perhaps two additional settings:

typesetter.softHyphen (boolean, true by default):
- If true, does what the above discussion suggests, i.e. replace it by a discretionary node
- If false, ignores it
typesetter.softHyphenWarning (boolean, false by default): whatever the previous, if true, warns when a soft hyphen is encountered.

Rationale: When copying text from other sources (Office documents, HTML pages), the latter may contain soft hyphens. I met that case when working on my previous book, with inputs from several origins....

The main problem with the current behavior (= no special handling, just passed to the shaper) is twofold:

Some fonts will show an hyphen dash (ex. Gentium Plus at least in some versions) even when the word is not hyphenated. Well, that case is easy to notice when proofreading, but it's wrong of course...
Even when fonts properly have a zero-width character here, our ICU logic will consider it as "breaking", inserting a penalty which may result in a line break without hyphen mark. And that case is much harder to notice when typesetting some amount of material (and I'm glad a proofreader saw it, but still, it could have been missed...)

Moreover, the shaper removes them in ligatures (so they are lost in those cases)... There's some asymmetry here!

These elements would be sufficient for advocating in favor of catching the soft hyphens and replacing them by an appropriate discretionary node.

Still I have other concerns:

For languages where we do have hyphenation patterns, this might not always be the best solution... The parts around a soft hyphen will be considered too by the hyphenation algorithm, split separately (rather than considering the word as a whole)...
Sometimes in these text sources, people placed a soft hyphen manually (sometimes even at wrong places) to solve an issue they had... Perhaps they had not enabled hyphenation in their Office software, or didn't install the necessary language files; for HTML sources that might even be a mere manual tweak.

So the ability to wholly skip soft hyphens from the input makes sense, because we have a better solution for hyphenation and exceptions.

And the ability to warn about them regardless also make sense, because they are hard to notice (e.g. in VSCode I can obviously have a configuration for showing such characters, but it requires a lot of scrutiny then).

Lastly, if we agree on this proposal, we'd need to document these settings somewhere. Should we have an extra chapter in the manual, such as "Unicode support & special cases" (e.g. just before the chapter concerning language support?).

It could also be a proper place for documenting #1918
Maybe one day we'll have some support for other special cases (e.g. U+200E and U+200F for writing direction) and those could be mentioned there too eventually (I've not checked how they currently behave).

What do you think?

sile-typesetter / sile

Native unicode U+00A9 (soft hyphen) support #1930