Closed stenskjaer closed 6 years ago
As noted in the closing commit message:
The current re
module has problems with the identification of word characters,
cf. https://bugs.python.org/issue1693050 and https://bugs.python.org/issue12731.
Moving to the regex
module means compliance with the Unicode 10 definition of
words, as per https://www.unicode.org/reports/tr29/#Word_Boundaries.
This solves the need for indicating a language, as this should result in a complete coverage of any language with the already existing \w
match group.
Currently the assumption is that text consists of material matching the
\w
.If a user has an edition outside that class, it will work significantly slower. But we cant just turn faster matching (currently basically
\w+
) up to match all possible code blocks, because that would make matching the few exceptional cases (\\{}
and punctuation) much more demanding.So it might be a good idea to make it possible to configure it to use one or more of the other languages. The full list of not included material is
This idea came from #25.