patrickfrey / strusUtilities

A set of command line programs to access the strus information retrieval engine
http://www.project-strus.net
Mozilla Public License 2.0
3 stars 0 forks source link

Problems with detecting word boundaries (\b) and UTF-8 #70

Open patrickfrey opened 5 years ago

patrickfrey commented 5 years ago

Some word boundaries are not detected correctly in UTF-8 input.

patrickfrey commented 5 years ago

A possible solution was proposed by forcing the use of a virtual character table with the option %LEXER BYTECHAR in the pattern matcher program or the defineOption method in the pattern lexer interface. This is not a fix in general but may shift some character definitions and help in some cases (e.g. german umlauts).

patrickfrey commented 5 years ago

For character sets in IsoLatin-1 the %LEXER BYTECHAR works.