tompazourek / NaturalSort.Extension

🔀 Extension method for StringComparison that adds support for natural sorting (e.g. "abc1", "abc2", "abc10" instead of "abc1", "abc10", "abc2").
MIT License
169 stars 13 forks source link

Add support for non-ASCII digits #74

Closed siegfriedpammer closed 5 months ago

siegfriedpammer commented 5 months ago

https://github.com/tompazourek/NaturalSort.Extension/blob/6ec645df09b2ed8eda02be91e2232eee6fb44243/src/NaturalSort.Extension/NaturalSortComparer.cs#L179-L181

There are many more Unicode codepoints that can be used as digits, as can be seen here: https://www.compart.com/en/unicode/category/Nd Each of these has a numeric value assigned, for example https://www.compart.com/en/unicode/U+0A68 (which has the value 2).

I suggest using char.IsDigit instead to handle this correctly.

see https://github.com/christophwille/poc-oh/blob/main/src/NaturalSortTests/Program.cs for a comparison with StrCmpLogicalW:

Input: A, A10, A11, Z, A੨, A੨੨ NaturalSort.Extensions: A, A੨, A੨੨, A10, A11, Z StrCmpLogicalW: A, A੨, A10, A11, A੨੨, Z

The sort order of StrCmpLogicalW makes perfect sense if you replace ੨ with 2.

tompazourek commented 5 months ago

Good point, thanks for contributing.

If I use these unicode digits, I'll need to find some way how to compare the string segments that are composed of unicode digits. Essentially "parsing" the unicode digits into numbers and comparing them. Currently if I only consider 0-9, the comparison is trivial and fast, and the number parsing doesn't even occur. I'm not sure if the current simple comparison of digit values would work well enough. But I suppose it might work better than just treating unicode digits as "other characters".

I like your example comparing results to StrCmpLogicalW. I think these sort of comparisons would be useful to add into tests.

tompazourek commented 5 months ago

I see that the Windows compare treats ੨ as something between 2 and 3. It would be interesting to find some simple mechanism that will let me do the same thing fast:

A A2 A੨ A3 A10 A11 A22 A੨੨ A33 Z

christophwille commented 5 months ago

I like your example comparing results to StrCmpLogicalW. I think these sort of comparisons would be useful to add into tests.

Feel free to do so, the code is from https://github.com/icsharpcode/ILSpy/blob/master/ILSpy/TreeNodes/NaturalStringComparer.cs - we were looking for options to no longer use a native import. That is when we were like "Wait, Unicode is more than 0-9".

tompazourek commented 5 months ago

This is now implemented in https://github.com/tompazourek/NaturalSort.Extension/commit/c3038964f29e217b0f9469f2cf029f1eb69a3572

It is released as version 4.3.0 (https://github.com/tompazourek/NaturalSort.Extension/releases/tag/4.3.0)

In case you find discrepancies, please file new issues.

Thank you again for contributing with this idea, it wouldn't have happened without you.