kappa / Text-Hyphen

Hyphenation using Knuth-Liang algorithm
Other
2 stars 4 forks source link

Allow loading patterns and exceptions from files or an array reference #1

Open bpj opened 7 years ago

bpj commented 7 years ago

The place to find suitable files is http://mirror.ctan.org/language/hyph-utf8/tex/generic/hyph-utf8/patterns/txt/ right from the people who maintain TeX hyphenation files.

PhilterPaper commented 3 years ago

Indeed, there is a large supply of hyphenation pattern files (and exception lists) there at ctan.org. Now, what would be the most useful way to make this available to users of Text::Hyphen? I suspect that most will not want the overhead (and failures due to connectivity problems) of directly reading from the CTAN site every time you use Text::Hyphen. I'm not sure that packaging all these files into Text::Hyphen (rather than individual Text::Hyphen::XX modules) would be good (too bulky, and most languages will not be used by any given user). Is there a way to subscribe to pattern/exception files for those languages one is interested in? Perhaps then build a local Text::Hyphen::XX on the fly, or add it to/ update Text::Hyphen?

I think it would be useful to extend Text::Hyphen::new() with a $lang language option (default, en_US) to bring in the appropriate pattern/exception files for this language, in the manner of Text::Hyphen::XX, for this hyphenation object. That way, you could hyphenate multiple languages in one session with multiple hyphenation objects, if desired. $lang might then trigger pulling in the pattern and exception files from a local library under Text::Hyphen. It would be up to the owner of the system to maintain these files in sync with CTAN (I doubt that they will change very often). There would be no need to maintain separate Text::Hyphen::XX modules. Perhaps Text::Hyphen could look in a local cache for the desired language content, and if not found, refresh it from CTAN? It might be good to have a local utility to check with CTAN and refresh any changed language files, as well as load any new ones on request.

Text/
   Hyphen/
      existing code
      languages/
         en_US/    shipped with Text::Hyphen
         de_DE/    optional German pattern/exceptions, etc.

OK, enough coffee-fueled ramblings for today!

PhilterPaper commented 3 years ago
  1. I vaguely recall that languages such as German (DE) have some strange behavior at word splits, doubling one of the letters or something. Does anyone know if the Text::Hyphen package handles that correctly? Is it specified in the CTAN and Text::Hyphen::DE patterns? Even if this behavior has been dropped in current German orthography, I'm sure that some people will want to be able to do it for older texts. It's not so much a matter of figuring where the hyphenation point is, as it is which letter to double (and where), and to account for this in fragment lengths.
  2. This winter I might have some time to do some extensions to this package to implement the CTAN read, and come up with a PR. But first, I need to know if KAPPA would be happy to merge such a PR. I'll open an issue to draw attention to this discussion (being in a PR, it may be hidden from many).
  3. In package TeX::Hyphen there is a parameter file for specifying the file (apparently combined pattern and exceptions), and a parameter style that has something to do with the language. The documentation mumbles something about language-specific shortcuts, but I haven't explored it. I should probably do a thorough comparison with Text::Hyphen before investing any more time in either.