sile-typesetter / sile

The SILE Typesetter — Simon’s Improved Layout Engine
https://sile-typesetter.org
MIT License
1.63k stars 96 forks source link

Implement Unicode script detection #83

Open khaledhosny opened 9 years ago

khaledhosny commented 9 years ago

The current handling of the 3 properties is sub-optimal and can be improved:

I’m looking for opinions about these proposed changes and code pointers to implement them.

simoncozens commented 9 years ago

First, thanks very much for all your help with the bidi issues; I would not have got there without it! And thanks for your input here - yes, I agree it's all a mess at the moment as it grew rather ad hoc. Would be good to rationalise it.

Starting with the easy one (direction), here's my opinion on how to approach it:

So:

khaledhosny commented 9 years ago

Sorry for not responding earlier, got really busy with real life in an unexpected way. The above plan sounds good, and you seem to have implemented (some of?) it, anything remaining in this area for me to work on?

simoncozens commented 9 years ago

No worries. I'm trying to move towards a 0.9.2 release, (it's been way too long and there've been many major fixes) and the manual didn't build any more so I needed to do a bit of work on it!

Direction

Some more thoughts from @deepakjois, taken from PR #78:

Some other points for discussion:

  • Given SILE is intended to be a next-gen typesetting package, bidi support should be built-in to one of the default document classes, probably plain. We should not need to import a package to get bidi support.
  • There should be support for typesetting short RTL text inside LTR paragraphs as well (and vice versa), something equivalent to the TeX bidi pacakages \LRE and \RLE

I was all set to say that turning on bidi support for absolutely everything imposes a huge overhead on the common case, but then I benchmarked it. It's something like a 10% penalty, which is probably acceptable and computers keep getting faster. So maybe should just always have bidi support on.

If we do that, then do we really need the \LRE and \LRE commands (or as we used to have in SILE, \font[direction=...]{})? Now direction is inferred automatically, there shouldn't be a need (as far as I can see) to set direction manually within a paragraph. I think the only reason we would now need support for such a thing is if we really wanted to allow people to deliberately typeset text "back to front". I'm sure they could write their own package for that if necessary. :-)

OK, so we add bidi support to everything and then I think direction is done.

Language

Thinking about language: The user needs to select the language manually to activate hyphenation rules, shaping (Urdu vs Arabic etc.) and other language-specific typographic practices. (Japanese line-breaking rules, kerning, etc.) Also in the future we need language-specific document elements ("Chapter..." etc.) - I haven't given very much thought as to how that will work.

So it's clear that (a) the user should be able to select the language, and (b) this really isn't a font property. This is already reflected in the awkwardness that font.lua contains a bunch of font.whatever settings and then a document.language setting---it's obviously the odd one out. But the reason it's in there is that language needs to be passed to the shaper, and also that when you change the language you may also want to change the script and the font as well so it feels user-friendly to do that as a single command. (But that can obviously be finessed in higher-level packages and commands later.)

Because it doesn't do any harm to have language in the font setting (it's just a little inconsistent), I don't think we'll put this in the 0.9.2 release. (Incidentally, the master branch is now preparing for release, and work towards 0.9.3 is temporarily going on in the devel brach; this will be merged into master after release.)

My suggestion would be:

Should be a nice easy job for someone. :-)

Script

I think script is OK as is. In the vast majority of cases the user doesn't need to specify it, and the only thing that happens with it is that it is passed straight to Harfbuzz to give the user more control over how shaping happens. I can't think that we would want to do anything other with it. Since Harfbuzz does the right thing most of the time, I don't think SILE needs to implement UAX 24.

khaledhosny commented 9 years ago

Direction

If we do that, then do we really need the \LRE and \LRE commands (or as we used to have in SILE, \font[direction=...]{})?

RTL inside LTR text (and the reverse) should just work now, and in the odd case where you want to use a different base direction for the subtext one can use BiDi control characters like U+202A LEFT-TO-RIGHT EMBEDDING, U+202D LEFT-TO-RIGHT OVERRIDE, etc. We can have short hand \LRE, \LRO commands that simplifies entering them.

(BTW, we need to update the UBA implementation to support the Unicode 6.3 additions.)

Language

I agree with the proposal above and will try to work on it, but I don’t think we need to deprecate language support in font, it can be useful when you want to use a different font language than the text language (for a badly design or incomplete fonts).

Script

I don’t agree here. HarfBuzz’s script detection is very simple and does not help with the characters with common script property case. Take for example this string:

ع ab (aa) cd ع

without proper script detection, the parenthesis will be assigned Arabic script. This might be OK for most fonts, but if a font have different, say, substitutions for the parenthesis based on the script, you will get the wrong substitution here. See for example this old version of Amiri Slanted, first is the wrong script detection and the second is the right one (I had to drop this feature because many applications were not handling this properly and I know just use upright parenthesis): script script2

simoncozens commented 9 years ago

On reflection I think you are right. From a user's perspective I think we would like to support the following:

The last item requires script detection, and the first two require separate commands. Please feel free to implement any or all of this. :-) I am focusing on trying to get Japanese working according to JIS X 4051 / W3C requirements...

Omikhleia commented 1 year ago

Scripts

As seen in #1726, script detection is needed in TTB cases too.