laurikari / tre

The approximate regex matching library and agrep command line tool.
Other
805 stars 134 forks source link

Trouble with wide characters on Fedora 20 #17

Open treveradams opened 10 years ago

treveradams commented 10 years ago

This is in relation to code I am trying to fix on C-ICAP Classify (LGPL, located at https://github.com/treveradams/C-ICAP-Classify/).

I do not speak Bulgarian and many other languages I use with this project, so this may not be unique to Bulgarian.

I find html tags using: treregwcomp(&htmlFinder, L"(</?(P|(BR))[^>]>[[:space:]])|(</?!?\w+((\s+\w+(\s=\s(?:\".?\"|'.?'|[^'\">\s]+))?)+\s|\s)/?>)((</?(P|(BR))[^>]>)[[:space:]]))_", REG_EXTENDED | REG_ICASE);

I need to save alt and title tags, I do this (alt version only changes the tag part): treregwcomp(&title1, L" title=\s((\"._?\"|'.*?')|[^'\">\s]+)", REG_EXTENDED | REG_ICASE);

The document which is in wchar_t is then searched using: tre_regwnexec(&title1, ...)

The problem is on Bulgarian language documents, the appropriate matchset is truncated. This doesn't happen on English and it seems to work on many other languages (I only speak two others, so it is a guess).

Package on Fedora 20: tre-0.8.0-8.fc20.x86_64

This appears to be caused by the engine seeing some character sequences as double quote or single quote instead of the correct character.

dmurdoch commented 10 years ago

This may be related to a bug just fixed in R, which is using a slightly modified TRE 0.8.0. When working on wchars, character classes like [:space:] are sometimes not matched properly (in fact they match any character). In 8 bit characters those are translated into the specific list of matches, but the wchar code is supposed to use an OS service to determine the character class.

The problem is in the tre_copy_ast function in tre_compile.c. These lines (starting around line 717)

    *result = tre_ast_new_literal(mem, min, max, pos);
    if (*result == NULL)
      status = REG_ESPACE;

don't copy the character class pointer. They need an extra line to do that. My patch adds it after, i.e.

    *result = tre_ast_new_literal(mem, min, max, pos);
    if (*result == NULL)
      status = REG_ESPACE;
    ((tre_literal_t*)(*result)->obj)->u.class = lit->u.class;  

It would be better to do this in the tre_ast_new_literal function (by adding a new argument for the class), but then there are a few other places that would need modifying too.

I'm not a regular git or GitHub user, so I don't know how to put this into a patch here. If someone else wants to do that, feel free.