Closed siegfriedpammer closed 5 months ago
Good point, thanks for contributing.
If I use these unicode digits, I'll need to find some way how to compare the string segments that are composed of unicode digits. Essentially "parsing" the unicode digits into numbers and comparing them. Currently if I only consider 0-9, the comparison is trivial and fast, and the number parsing doesn't even occur. I'm not sure if the current simple comparison of digit values would work well enough. But I suppose it might work better than just treating unicode digits as "other characters".
I like your example comparing results to StrCmpLogicalW
. I think these sort of comparisons would be useful to add into tests.
I see that the Windows compare treats ੨ as something between 2 and 3. It would be interesting to find some simple mechanism that will let me do the same thing fast:
A A2 A੨ A3 A10 A11 A22 A੨੨ A33 Z
I like your example comparing results to
StrCmpLogicalW
. I think these sort of comparisons would be useful to add into tests.
Feel free to do so, the code is from https://github.com/icsharpcode/ILSpy/blob/master/ILSpy/TreeNodes/NaturalStringComparer.cs - we were looking for options to no longer use a native import. That is when we were like "Wait, Unicode is more than 0-9".
This is now implemented in https://github.com/tompazourek/NaturalSort.Extension/commit/c3038964f29e217b0f9469f2cf029f1eb69a3572
It is released as version 4.3.0 (https://github.com/tompazourek/NaturalSort.Extension/releases/tag/4.3.0)
In case you find discrepancies, please file new issues.
Thank you again for contributing with this idea, it wouldn't have happened without you.
https://github.com/tompazourek/NaturalSort.Extension/blob/6ec645df09b2ed8eda02be91e2232eee6fb44243/src/NaturalSort.Extension/NaturalSortComparer.cs#L179-L181
There are many more Unicode codepoints that can be used as digits, as can be seen here: https://www.compart.com/en/unicode/category/Nd Each of these has a numeric value assigned, for example https://www.compart.com/en/unicode/U+0A68 (which has the value 2).
I suggest using
char.IsDigit
instead to handle this correctly.see https://github.com/christophwille/poc-oh/blob/main/src/NaturalSortTests/Program.cs for a comparison with
StrCmpLogicalW
:Input: A, A10, A11, Z, A੨, A੨੨ NaturalSort.Extensions: A, A੨, A੨੨, A10, A11, Z StrCmpLogicalW: A, A੨, A10, A11, A੨੨, Z
The sort order of
StrCmpLogicalW
makes perfect sense if you replace ੨ with 2.