Does default_para_level correspond to TR9#HL1?

servo / unicode-bidi

Implementation of the Unicode Bidirection Algorithm in Rust

Other

78 stars 33 forks source link

Does default_para_level correspond to TR9#HL1? #129

Open dhardy opened 8 months ago

dhardy commented 8 months ago

Presumably it does, but it is not clear how (in particular, the name may be wrong).

HL1 states:

Override P3, and set the paragraph embedding level explicitly. This does not apply when deciding how to treat FSI in rule X5c.

... then gives some suggestions for how this may do so, but no specific requirement. One of those suggestions is to follow P2 and P3, but default to level 1 (RTL).

What I find from testing is that providing Some value via default_para_level to BidiInfo::new overrides the paragraph level. Given the name, I would expect that instead it only acts as a default (i.e. when P2 does not find a strongly directional character).

Suggestions:

Rename default_para_level to override_para_level
Make default_para_level act only when no strongly-directional (L/AL/R) character is found. (In this case, it does not need to be optional.)
Revise to support both cases: pub fn new(text: &str, default_para_level: Level, override_para_level: Option<Level>) -> BidiInfo<'_>

From a usage point-of-view, I can see value in being able to specify a default line direction (i.e. suggestion 2) for e.g. an input field when it may be empty or contain only formatting characters. The current behaviour (override) achieves the same thing, but also messes up formatting when the directionality is detectable (e.g. a sentence with a full stop as in #52), for little gain.

dhardy commented 8 months ago

Test case (extra line-breaks inserted to allow correct rendering in the browser):

Example text in multiple languages.

مثال على نص بلغات متعددة.

Пример текста на нескольких языках.

טקסט לדוגמא במספר שפות.

Level runs:

None:
        Level(0), text[0..35]: 'Example text in multiple languages.', HardBreak, breaks=[8, 13, 16, 25]
        Level(1), text[36..81]: 'مثال على نص بلغات متعددة.', NoBreak, breaks=[68, 57, 52, 45]
        Level(1), text[81..81]: '', HardBreak, breaks=[]
        Level(0), text[82..147]: 'Пример текста на нескольких языках.', HardBreak, breaks=[95, 108, 113, 134]
        Level(1), text[148..190]: 'טקסט לדוגמא במספר שפות.', breaks=[181, 170, 157]

Some(LTR_LEVEL):
        Level(0), text[0..35]: 'Example text in multiple languages.', HardBreak, breaks=[8, 13, 16, 25]
        Level(1), text[36..80]: 'مثال على نص بلغات متعددة', NoBreak, breaks=[68, 57, 52, 45]
        Level(0), text[80..81]: '.', NoBreak, breaks=[]
        Level(0), text[81..81]: '', HardBreak, breaks=[]
        Level(0), text[82..147]: 'Пример текста на нескольких языках.', HardBreak, breaks=[95, 108, 113, 134]
        Level(1), text[148..189]: 'טקסט לדוגמא במספר שפות', NoBreak, breaks=[181, 170, 157]
        Level(0), text[189..190]: '.', breaks=[]

Some(RTL_LEVEL):
        Level(2), text[0..34]: 'Example text in multiple languages', NoBreak, breaks=[8, 13, 16, 25]
        Level(1), text[34..35]: '.', HardBreak, breaks=[]
        Level(1), text[36..81]: 'مثال على نص بلغات متعددة.', NoBreak, breaks=[68, 57, 52, 45]
        Level(1), text[81..81]: '', HardBreak, breaks=[]
        Level(2), text[82..146]: 'Пример текста на нескольких языках', NoBreak, breaks=[95, 108, 113, 134]
        Level(1), text[146..147]: '.', HardBreak, breaks=[]
        Level(1), text[148..190]: 'טקסט לדוגמא במספר שפות.', breaks=[181, 170, 157]

Manishearth commented 8 months ago

Yes, it does. The naming comes from the standard naming for HL1 from other bidi implementations, including ICU4C.

While typically HL1 is viewed as a higher level algorithm, in practice it really should just be an input to the main algorithm. This is a bit of a discrepancy between the spec and how real world users tend to conceptualize the algorithm.

It's "default" because further embedding will change the level: it is not the level of the paragraph, it is the level the paragraph starts with. "override" carries a stronger connotation here.

I would prefer to just document this better.

dhardy commented 8 months ago

Personally I'd still like to see something like this:

pub fn BidiInfo::new(text: &str, base_para_level: Option<Level>, default_base_para_level: Level) -> Self

Doc:

When base_para_level is None, the base paragraph direction is inferred from the first strongly-directional character of text. If no such character is found, default_base_para_level is used. If, instead, base_para_level = Some(level), then the base paragraph direction is defined by level; alternative-direction texts form inclusions within this paragraph (excluding trailing undirectional formatting such as a full-stop).

Or maybe just:

pub fn BidiInfo::new(text: &str, infer: bool, base_para_level: Level) -> Self

Doc:

When infer = true, the base paragraph direction is inferred from the first strongly-directional character of text. If no such character is found, base_para_level is used. If, instead, infer = false, then the base paragraph direction is defined by base_para_level; alternative-direction texts form inclusions within this paragraph (excluding trailing undirectional formatting such as a full-stop).

Then BidiInfo::new(text, false, default_para_level.unwrap_or(LTR_LEVEL)) would be equivalent to the current functionality.

Manishearth commented 8 months ago

This doesn't really match typical usage patterns of the bidi algorithm.

Basically, either you have a signal from the embedding environment (e.g. your html dir attribute, or surrounding bidi text in the case of embedded inline content) as to what level to use, or you don't.

I don't want to overrotate on how the spec conceptualizes these things. The spec probably should have HL1 clarified.

dhardy commented 8 months ago

html dir attribute

In this case, the current behaviour makes sense. For CSS's direction property it doesn't in my opinion, though of course I'm not going to advocate for breaking web standards.

I see you already have this in your code-base:

    /// TODO: Support auto-RTL base direction

Personally, I've decided to use the following:

/// Directionality of text
///
/// Texts may be bi-directional as specified by Unicode Technical Report #9.
#[derive(Copy, Clone, Debug, Default, PartialEq, Eq, Ord, PartialOrd, Hash)]
#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]
#[repr(u8)]
pub enum Direction {
    /// Auto-detect direction
    ///
    /// The text direction is inferred from the first strongly-directional
    /// character. In case no such character is found, the text will be
    /// left-to-right.
    #[default]
    Auto = 2,
    /// Auto-detect, default right-to-left
    ///
    /// The text direction is inferred from the first strongly-directional
    /// character. In case no such character is found, the text will be
    /// right-to-left.
    AutoRtl = 3,
    /// The base text direction is left-to-right
    ///
    /// If the text contains right-to-left content, this will be considered an
    /// embedded right-to-left run. Non-directional leading and trailing
    /// characters (e.g. a full stop) will normally not be included within this
    /// right-to-left section.
    ///
    /// This uses Unicode TR9 HL1 to set an explicit paragraph embedding level of 0.
    Ltr = 0,
    /// The base text direction is right-to-left
    ///
    /// If the text contains left-to-right content, this will be considered an
    /// embedded left-to-right run. Non-directional leading and trailing
    /// characters (e.g. a full stop) will normally not be included within this
    /// left-to-right section.
    ///
    /// This uses Unicode TR9 HL1 to set an explicit paragraph embedding level of 1.
    Rtl = 1,
}

IIUC AutoRtl can be emulated by calling get_base_direction first.

Manishearth commented 8 months ago

In this case, the current behaviour makes sense. For CSS's direction property it doesn't in my opinion, though of course I'm not going to advocate for breaking web standards.

because direction doesn't support auto? Yeah, the direction style needs some love, but also it's not recommended in HTML so i don't think there's much impetus to improve it.

I believe calling it the base direction in the API would be fine though if we're doing that I would want documentation that uses both names.

It may be worth documenting how to get AutoRtl though really I think this is a thing that would benefit from a new API. Unfortunately it's not easy to change the current one without it being a breaking change.

Note that the basic enum is insufficient for this crate: there are valid use cases for setting a base level that is greater than 1, typically for doing incremental updates to embedded content.