Open dhardy opened 8 months ago
Test case (extra line-breaks inserted to allow correct rendering in the browser):
Example text in multiple languages.
مثال على نص بلغات متعددة.
Пример текста на нескольких языках.
טקסט לדוגמא במספר שפות.
Level runs:
None:
Level(0), text[0..35]: 'Example text in multiple languages.', HardBreak, breaks=[8, 13, 16, 25]
Level(1), text[36..81]: 'مثال على نص بلغات متعددة.', NoBreak, breaks=[68, 57, 52, 45]
Level(1), text[81..81]: '', HardBreak, breaks=[]
Level(0), text[82..147]: 'Пример текста на нескольких языках.', HardBreak, breaks=[95, 108, 113, 134]
Level(1), text[148..190]: 'טקסט לדוגמא במספר שפות.', breaks=[181, 170, 157]
Some(LTR_LEVEL):
Level(0), text[0..35]: 'Example text in multiple languages.', HardBreak, breaks=[8, 13, 16, 25]
Level(1), text[36..80]: 'مثال على نص بلغات متعددة', NoBreak, breaks=[68, 57, 52, 45]
Level(0), text[80..81]: '.', NoBreak, breaks=[]
Level(0), text[81..81]: '', HardBreak, breaks=[]
Level(0), text[82..147]: 'Пример текста на нескольких языках.', HardBreak, breaks=[95, 108, 113, 134]
Level(1), text[148..189]: 'טקסט לדוגמא במספר שפות', NoBreak, breaks=[181, 170, 157]
Level(0), text[189..190]: '.', breaks=[]
Some(RTL_LEVEL):
Level(2), text[0..34]: 'Example text in multiple languages', NoBreak, breaks=[8, 13, 16, 25]
Level(1), text[34..35]: '.', HardBreak, breaks=[]
Level(1), text[36..81]: 'مثال على نص بلغات متعددة.', NoBreak, breaks=[68, 57, 52, 45]
Level(1), text[81..81]: '', HardBreak, breaks=[]
Level(2), text[82..146]: 'Пример текста на нескольких языках', NoBreak, breaks=[95, 108, 113, 134]
Level(1), text[146..147]: '.', HardBreak, breaks=[]
Level(1), text[148..190]: 'טקסט לדוגמא במספר שפות.', breaks=[181, 170, 157]
Yes, it does. The naming comes from the standard naming for HL1 from other bidi implementations, including ICU4C.
While typically HL1 is viewed as a higher level algorithm, in practice it really should just be an input to the main algorithm. This is a bit of a discrepancy between the spec and how real world users tend to conceptualize the algorithm.
It's "default" because further embedding will change the level: it is not the level of the paragraph, it is the level the paragraph starts with. "override" carries a stronger connotation here.
I would prefer to just document this better.
Personally I'd still like to see something like this:
pub fn BidiInfo::new(text: &str, base_para_level: Option<Level>, default_base_para_level: Level) -> Self
Doc:
When
base_para_level
isNone
, the base paragraph direction is inferred from the first strongly-directional character oftext
. If no such character is found,default_base_para_level
is used. If, instead,base_para_level = Some(level)
, then the base paragraph direction is defined bylevel
; alternative-direction texts form inclusions within this paragraph (excluding trailing undirectional formatting such as a full-stop).
Or maybe just:
pub fn BidiInfo::new(text: &str, infer: bool, base_para_level: Level) -> Self
Doc:
When
infer = true
, the base paragraph direction is inferred from the first strongly-directional character oftext
. If no such character is found,base_para_level
is used. If, instead,infer = false
, then the base paragraph direction is defined bybase_para_level
; alternative-direction texts form inclusions within this paragraph (excluding trailing undirectional formatting such as a full-stop).
Then BidiInfo::new(text, false, default_para_level.unwrap_or(LTR_LEVEL))
would be equivalent to the current functionality.
This doesn't really match typical usage patterns of the bidi algorithm.
Basically, either you have a signal from the embedding environment (e.g. your html dir attribute, or surrounding bidi text in the case of embedded inline content) as to what level to use, or you don't.
I don't want to overrotate on how the spec conceptualizes these things. The spec probably should have HL1 clarified.
html dir attribute
In this case, the current behaviour makes sense. For CSS's direction
property it doesn't in my opinion, though of course I'm not going to advocate for breaking web standards.
I see you already have this in your code-base:
/// TODO: Support auto-RTL base direction
Personally, I've decided to use the following:
/// Directionality of text
///
/// Texts may be bi-directional as specified by Unicode Technical Report #9.
#[derive(Copy, Clone, Debug, Default, PartialEq, Eq, Ord, PartialOrd, Hash)]
#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]
#[repr(u8)]
pub enum Direction {
/// Auto-detect direction
///
/// The text direction is inferred from the first strongly-directional
/// character. In case no such character is found, the text will be
/// left-to-right.
#[default]
Auto = 2,
/// Auto-detect, default right-to-left
///
/// The text direction is inferred from the first strongly-directional
/// character. In case no such character is found, the text will be
/// right-to-left.
AutoRtl = 3,
/// The base text direction is left-to-right
///
/// If the text contains right-to-left content, this will be considered an
/// embedded right-to-left run. Non-directional leading and trailing
/// characters (e.g. a full stop) will normally not be included within this
/// right-to-left section.
///
/// This uses Unicode TR9 HL1 to set an explicit paragraph embedding level of 0.
Ltr = 0,
/// The base text direction is right-to-left
///
/// If the text contains left-to-right content, this will be considered an
/// embedded left-to-right run. Non-directional leading and trailing
/// characters (e.g. a full stop) will normally not be included within this
/// left-to-right section.
///
/// This uses Unicode TR9 HL1 to set an explicit paragraph embedding level of 1.
Rtl = 1,
}
IIUC AutoRtl
can be emulated by calling get_base_direction
first.
In this case, the current behaviour makes sense. For CSS's
direction
property it doesn't in my opinion, though of course I'm not going to advocate for breaking web standards.
because direction
doesn't support auto
? Yeah, the direction
style needs some love, but also it's not recommended in HTML so i don't think there's much impetus to improve it.
I believe calling it the base direction in the API would be fine though if we're doing that I would want documentation that uses both names.
It may be worth documenting how to get AutoRtl though really I think this is a thing that would benefit from a new API. Unfortunately it's not easy to change the current one without it being a breaking change.
Note that the basic enum is insufficient for this crate: there are valid use cases for setting a base level that is greater than 1, typically for doing incremental updates to embedded content.
Presumably it does, but it is not clear how (in particular, the name may be wrong).
HL1 states:
... then gives some suggestions for how this may do so, but no specific requirement. One of those suggestions is to follow P2 and P3, but default to level 1 (RTL).
What I find from testing is that providing
Some
value viadefault_para_level
toBidiInfo::new
overrides the paragraph level. Given the name, I would expect that instead it only acts as a default (i.e. when P2 does not find a strongly directional character).Suggestions:
default_para_level
tooverride_para_level
default_para_level
act only when no strongly-directional (L/AL/R) character is found. (In this case, it does not need to be optional.)pub fn new(text: &str, default_para_level: Level, override_para_level: Option<Level>) -> BidiInfo<'_>
From a usage point-of-view, I can see value in being able to specify a default line direction (i.e. suggestion 2) for e.g. an input field when it may be empty or contain only formatting characters. The current behaviour (override) achieves the same thing, but also messes up formatting when the directionality is detectable (e.g. a sentence with a full stop as in #52), for little gain.