tapeinosyne / hyphenation

Text hyphenation for Rust
Apache License 2.0
53 stars 12 forks source link

Incorrect handling of hyphens #16

Closed baskerville closed 5 years ago

baskerville commented 6 years ago

The following example:

let en_us = Standard::from_embedded(Language::EnglishUS).unwrap();
let hyphenated = en_us.hyphenate("self-aware");
let segments: Vec<&str> = hyphenated.iter().segments().collect();
println!("{:?}", segments);

yields the following output:

["self", "-aware"]

Am I correct in thinking that the proper output should be ["self-", "aware"]?

tapeinosyne commented 6 years ago

Note that iter().segments() only returns string slices without inserting a hyphen before breaks, meaning that your expected output would become ["self--", "aware"] once marked:

let hyphenated = en_us.hyphenate("self-aware");
let collected: Vec<String> = hyphenated.iter().collect();
// collected would be `vec!["self--", "aware"]`

While this is clearly inconsistent, I take you meant to say that hyphenate() should recognize the existing hyphen and break after it, with the final output being ["self-", "aware"] when marked, whereas the American English dictionary yields ["self-", "-aware"]. Neither is wrong; rather, the output you suggest implies a preference for this style:

To his dismay, Bob realized the smart toaster had become self-
aware

whereas the TeX patterns lean toward doubling the hyphen:

To his dismay, Bob realized the smart toaster had become self-
-aware

Whether to open new lines with a second hyphen when breaking hyphen-joined compounds is an editorial decision, with some languages having stronger conventions than others. With reference to your example, it should be noted that the American English patterns do not cover hyphens generally; this specific result is produced by a rule to break after the "self" substring regardless of what follows.

It should also be noted that the hyphenation API does not handle existing hyphens on its own, as mentioned in the notes about word segmentation; individual dictionaries may do it, but only as a consequence of what's included in the TeX patterns.

(I had originally written v0.7 to handle hyphens independently, but ultimately decided against it, since diverging conventions apply, and adopting one is ultimately an editorial decision. I'm not strictly against offering a default, mind, but not before the library sees some extended usage.)

baskerville commented 6 years ago

Note that iter().segments() only returns string slices without inserting a hyphen before breaks, meaning that your expected output would become ["self--", "aware"] once marked:

let hyphenated = en_us.hyphenate("self-aware");
let collected: Vec<String> = hyphenated.iter().collect();
// collected would be `vec!["self--", "aware"]`

Thanks, I'm aware of that. In fact, I want to deal with the slices and not with the strings with the added hyphens and I had to dig into the code to realize that .iter().segments() was what I wanted.

Also, it might be worth noting that the current README has broken asserts:

let segments = hyphenated.into_iter();
let collected : Vec<String> = segments.collect();
assert_eq!(collected, vec!["hy", "phen", "ation"]);

While this is clearly inconsistent, I take you meant to say that hyphenate() should recognize the existing hyphen and break after it, with the final output being ["self-", "aware"] when marked, whereas the American English dictionary yields ["self-", "-aware"].

I knew about the existence of language specific hyphenation rules, but not about this one.

It should also be noted that the hyphenation API does not handle existing hyphens on its own, as mentioned in the notes about word segmentation; individual dictionaries may do it, but only as a consequence of what's included in the TeX patterns. (I had originally written v0.7 to handle hyphens independently, but ultimately decided against it, since diverging conventions apply, and adopting one is ultimately an editorial decision. I'm not strictly against offering a default, mind, but not before the library sees some extended usage.)

Don't worry: I'm using a custom slice iterator when the word I'm hyphenating contains special characters, otherwise I just use hyphenation's slice iterator. The two iterators are unified via Either.

tapeinosyne commented 6 years ago

Thanks, I'm aware of that. In fact, I want to deal with the slices and not with the strings with the added hyphens and I had to dig into the code to realize that .iter().segments() was what I wanted.

Mh, I'll add some module documentation. The function is nominally documented but not all that visibile.

Also, it might be worth noting that the current README has broken asserts:

Thank you, I'll fix those. (I test code examples manually because Cargo / Rustdoc require them to be laid out in a rather unwieldy fashion, and things do slip through.)