tchwork / utf8

Portable and performant UTF-8, Unicode and Grapheme Clusters for PHP
Apache License 2.0
627 stars 50 forks source link

strcmp should consider collation #6

Closed Hywan closed 11 years ago

Hywan commented 11 years ago

Hello,

In strcmp, you normalize strings but it does not solve the problem of collation. For example, côte < coté (in french), and strcmp returns 88. The result is not the same when using Collator:

$c = new \Collator('fr_FR');
$c->setAttribute(\Collator::FRENCH_COLLATION, \Collator::ON);
var_dump(
    $c->compare('côte', 'coté')
);

But, I can't figure out if it is an issue or not, since strcmp should normally perform a binary-safe comparison, but you normalize strings in order to compare them. So it's like an attempt to solve this problem.

Thoughts?

nicolas-grekas commented 11 years ago

Collation support is out of scope for Patchwork UTF-8, you have to use Collator as you figured out instead.

u::strcmp() is suitable for equality comparisons of NFC, NFD and non-normalized strings, but falls short in term of inequalities. I can't figure out any generic enough sorting method, so currently, that's how it works. So, not a bug...

But if you have any better idea, please tell me!

Thanks for reporting

nicolas-grekas commented 11 years ago

You could also use u::strnatcmp(), u::strcasecmp() or even u::strnatcasecmp() if you do not bother for diacritics and/or case

Hywan commented 11 years ago

There is no better solution than Collator + locale :-). So it's not an issue, thanks!