hyphenation / tex-hyphen

Hyphenation patterns for TeX
53 stars 20 forks source link

US English patterns are buggy #15

Closed roozbehp closed 5 years ago

roozbehp commented 7 years ago

Debugging an Android user report, I found that Android was hyphenating the words "democrat" and "democrats" incorrectly, as:

de-mo-c-rat de-moc-rats

While Merriam Webster was recommending:

dem-o-crat

And Plain TeX was hyphenating as:

demo-crat democrats

Digging deeper, the source of the problem seems to be the following pattern in hyph-en-us.pat.txt:

5moc1ra1t

That pattern seems to not exist in Plain TeX's pattern file for US English. The other patterns applying to those words, all existing in Plain TeX, are:

1mo 4mocr 5crat.

I think the source of the problem is that the authors of the extended pattern file derived the modified patterns based on TUGboat's exception list, they created that "5moc1ra1t" pattern based on the word "de-moc-ra-tism" and didn't notice that adding it would cause "democrat" and "democrats" to be hyphenated incorrectly.

I guess these two words would not be the only exceptions, and there should be tens of other words that are affected by a similar problem of over-weighing the exception list.

huftis commented 7 years ago

Note that the New Oxford Spelling Dictionary recommends

demo-crat

But it also recommends

dem|oc¦ra|tise

where | indicates a primary hyphenation point and ¦ a secondary hyphenation point. But I guess this makes sense, since it is consistent with the way the words are pronounced.

reutenauer commented 6 years ago

Thanks for the report; I doubt we’ll be updated the patterns in TeX distributions, but it’s always good to know.

mojca commented 6 years ago

Shouldn't we report this to the author and Barbara Beeton?

reutenauer commented 6 years ago

Sure, we can, but Barbara is already aware, see http://tug.org/pipermail/tex-hyphen/2017-June/001613.html (although she doesn’t seem to have published the new list of hyphenation exceptions); and you know as well as I do that even if can improve the current en-US patterns there will be immense resistance to changing them in distributions due to stability concerns. But I’ll contact Gerard Kuiken if I can find the time.

reutenauer commented 5 years ago

Closing; we unfortunately feel we may not simply update the patterns because of the TeX community’s vaguely formulated compatibility policies, but I have noted this as a “known bug” (only two at the moment, this one and another one which is rather funny).

We would be happy to hear about other reported errors in the patterns, and if there is a tracker that we can follow, please let us know!

kberry commented 3 years ago

de-moc-rats (etc.) is a bug in Kuiken's usenglishmax. Knuth's patterns find no hyphenation points, a different/expected/unimportant bug.

Can't usenglishmax have a different list of exceptions than the regular english? I see no prospect of updating Kuiken's patterns, unless you want to try contacting him. I haven't communicated with him in more than a decade.