servo / unicode-bidi

Implementation of the Unicode Bidirection Algorithm in Rust
Other
78 stars 33 forks source link

[RFE] Provide an API to explicitly process a single paragraph #114

Closed jfkthame closed 9 months ago

jfkthame commented 10 months ago

Clients that have already divided text into paragraphs could use a more efficient API that does not split on paragraph separators, and only has to return a single paragraph record rather than a vector.

@Manishearth Is this something you would be open to adding as a new API? (Exact name etc TBD...)

Manishearth commented 10 months ago

Open to it if it can be done with minimal maintenance burden and documented well; it's already somewhat hard to figure out how these APIs are to be used.

Perhaps post an idea of what you think the api would look like first?

jfkthame commented 10 months ago

OK, here's a strawman for consideration. To minimize maintenance and learning burdens, let's keep the API directly parallel to the existing BidiInfo, except that instead of creating a Vec<ParagraphInfo>, it just records a single paragraph level. This is the only piece of ParagraphInfo we need for the (single) paragraph being processed (its range is implicitly the entire text that was passed).

So it'd be something like:

pub struct ParagraphBidiInfo<'text> {
    pub text: &'text str,
    pub original_classes: Vec<BidiClass>,
    pub levels: Vec<Level>,
    pub paragraph_level: Level,
}

and would provide the same methods as BidiInfo, except that they'd never need a paragraph parameter. So we'd implement:

pub fn new(text: &str, default_para_level: Option<Level>) -> BidiInfo<'_>

pub fn new_with_data_source<'a, D: BidiDataSource>( data_source: &D, text: &'a str, default_para_level: Option<Level> ) -> BidiInfo<'a>

pub fn reordered_levels( &self, line: Range<usize> ) -> Vec<Level>

pub fn reordered_levels_per_char( &self, line: Range<usize> ) -> Vec<Level>

pub fn reorder_line( &self, line: Range<usize> ) -> Cow<'text, str>

pub fn visual_runs( &self, line: Range<usize> ) -> (Vec<Level>, Vec<LevelRun>)

pub fn has_rtl(&self) -> bool

(and the same in utf16).

Basically all the implementation would be shared with the existing APIs, and documentation would be the same modulo dropping paragraph parameters.

One thing we'd need to verify is how this would/should behave if a paragraph-separator character does occur in the text. Arguably, that'd be a usage error, but obviously we'd need to make sure it's handled in a well-defined and safe way.

WDYT?

Manishearth commented 10 months ago

I guess that's fine. I was hoping this could be done with a single API addition, but I guess it can't. If our fields weren't public I would have suggested we make the paragraphs field of BidiInfo a SmallVec and just use the same API.

Yeah, this is fine, when writing the duplicate APIs please add comments saying that they should be kept mostly in sync with the other ones.

jfkthame commented 9 months ago

This was done in #115.