laurikari / tre

The approximate regex matching library and agrep command line tool.
Other
797 stars 133 forks source link

collation order vs encoding order in range matching #88

Open bbolker opened 1 year ago

bbolker commented 1 year ago

The TRE documentation defines a range as

Two characters separated by -. This is shorthand for the full range of characters between those two (inclusive) in the collating sequence.

(here in the repository)

However, testing with the Estonian locale (in R's imported version of TRE) shows that T is incorrectly matched by [A-Z] ... this comment says

/ XXX - Should use collation order instead of encoding values in character ranges. /

Would it be correct to change the documentation to say

The characters to include are determined by Unicode code point ordering.

as in the ICU documentation ... ?

trushworth commented 1 year ago

Hi,

Reply is in-line below.

On Sat, Jun 03, 2023 at 08:33:26AM -0700, Ben Bolker wrote:

The TRE documentation defines a range as

Two characters separated by -. This is shorthand for the full range of characters between those two (inclusive) in the collating sequence.

(here in the repository)

However, testing with the Estonian locale (in R's imported version of TRE) shows that T is incorrectly matched by [A-Z] ... this comment says

This sort of problem shows up in the ordinary unix commands too. There was a recent mention on one of the FreeBSD lists that "find" with a range of "[A-Z]" used the collating order so it matched the lower case letters too, which would have suprised an old unix "find" user like me, who's used to that range using the ASCII encoding. It gets even more complicated when you start dealing with data containing multiple languages because there doesn't seem to be a single collating order for all the languages at once. If you happen to know one, I need it in another project of mine :).

/ XXX - Should use collation order instead of encoding values in character ranges. /

Would it be correct to change the documentation to say

The characters to include are determined by Unicode code point ordering.

I think that's the best answer for the short term. I'm unfortunately only literate in one language, so I hesitate to offer too strong an opinion on any collating or encoding issue.

as in the ICU documentation ... ?

I'll add that to the list of easy to make changes I'm working on. I'm currently (2023 June) working on the PR's, but if I haven't done enough damage by the time I've sorted out those :) I'll see what I can make of the outstanding issues.

-- Reply to this email directly or view it on GitHub: https://github.com/laurikari/tre/issues/88 You are receiving this because you are subscribed to this thread.

Message ID: @.***>

Cheers,

Tom Rushworth

bbolker commented 1 year ago

For what it's worth there's an extensive Stack Overflow answer demonstrating that many tested regex implementations use locale-/collating-sequence-independent ranges (the only exceptions were some versions of grep and awk)

trushworth commented 1 year ago

Hi,

On Sun, Jun 04, 2023 at 02:03:02PM -0700, Ben Bolker wrote:

For what it's worth there's an extensive Stack Overflow answer demonstrating that many tested regex implementations use locale-/collating-sequence-independent ranges (the only exceptions were some versions of grep and awk)

Ah - interesting. I think my unease with all this is that it makes regex matching locale dependent, and not in a nice way. Consider the following:

1) You have some crowd sourced data, such as tags for photos or music.

2) Because it is crowd sourced, it can easily be in different languages, even if it is all stored in UTF-8 (or whatever), so it would have to have some sort of locale information stored with the value.

3) Matching character classes like [:alpha:] correctly in that data means defining the classes differently for different pieces of data. You would have to provide a locale for the data at match time, not for the regex at compile time.

I suppose from an implementation viewpoint, the classes could be handled by adding a locale parameter to the regex match function and some sort of plugin accessed by the match function code. The problem is that then you need the plugins, and your nice clean regex library suddenly has a very large and somewhat loosely defined dependency.

I didn't look at the details of the language testing in that stack overflow discussion, but I didn't see any of them passing a locale at match time, it all seems to be passing or picking a default locale at compile time, so none of them are going to work properly on multligual data :(.

As an alternative, I suppose we could hope that all Unicode codepoints have language independent classification. I don't know enough about Unicode to know if that is realistic, but if it is then a character class definition could be implemented that checked by codepoint value independent of language. Then all we have to do is persuade the Unicode people to stop adding codepoints :).

Just to add one final complication, how should these not-yet-perfectly-defined character classes work with approximate matching?

I do not think there really is an answer right now (2023), but I can hope for more multilingual programmers than I to come up with one :). Until then, I think documenting the issue (as you suggested) is the best we can do.

-- Reply to this email directly or view it on GitHub: https://github.com/laurikari/tre/issues/88#issuecomment-1575713003 You are receiving this because you commented.

Message ID: @.***>

Cheers,

Tom Rushworth

bbolker commented 1 year ago

Honestly, I'm not too worried about this. The practical advice that's always given (which I agree with) is "don't rely on ranges to identify alphabetic characters, use [:alpha:] instead"). I think if you wanted to handle the crazy edge case of "find alphabetic characters that are between A and Z in my locale", you'd have to hack it yourself ... (the general advice is that if you want to match only uppercase ASCII letters reliably, you have to enumerate the class as [ABCDEFGHIJKLMNOPQRSTUVWXYZ])