javve / natural-sort

Natural sort algorithm with unicode support written by Jim Palmer
40 stars 18 forks source link

Sorting incorrect when there is a space #7

Open 0xadri opened 9 years ago

0xadri commented 9 years ago

var stringAlphanumericalOne = [' ','1','a']; var stringAlphanumericalTwo = ['img ','img1','imga', 'imgz']; var stringAlphanumericalThree = ['img 99','img199','imga99', 'imgz99'];

console.log( stringAlphanumericalOne.sort(naturalSort) ); console.log( stringAlphanumericalTwo.sort(naturalSort) ); /* in the call below it fails! order should be ["img 99", "img199", "imga99", "imgz99"] */ console.log( stringAlphanumericalThree.sort(naturalSort) );

see live on http://jsbin.com/cipimosedoqe/1/edit?js,console

0xadri commented 9 years ago

Note that solving this might also solve this "list.js" issue :) https://github.com/javve/list.js/issues/185

0xadri commented 9 years ago

Bugs found

I'm playing further with this sorting algorithm and notice more strange "natural sorting order" choices.

see how the basic latin characters are sorted strangely on http://jsbin.com/cipimosedoqe/3/edit?js,console

A bunch of special characters appearing:

I would have expected special characters to all be "grouped" together in one place, except for the space special character maybe (which would always be the first character). That is, either all before numbers, or all between numbers and letters (lowercase & uppercase being "together" one after another), or all after letters.

The other javascript "natural string sort order" implementations

I ended up doing further research, investigating other implementation/algorithm of the natural string sort order.

My conclusion is that they all fail to provide a consistent order when I start adding barely unusual characters (ie. characters with diacritics or charcters such as dash, exclamation mark and so on).

Research on the custom implementations:

The browser's native "natural string sort order" implementations

The built-in localeCompare() method does a much better job at sorting, even international & special characters. The only problem using the localeCompare() method is that "the locale and sort order used are entirely implementation dependent". In other words, when using localeCompare such as stringOne.localeCompare(stringTwo): Firefox, Safari, Chrome & IE have a different sort order for Strings. see https://code.google.com/p/v8/issues/detail?id=459

Research on the browser-native implementations:

Difficulty of "string natural sorting order"

Implementing a solid algorithm (meaning: consistent but also covering a wide range of characters) is a very tough task. UTF8 contains more than 2000 characters & covers more than 120 scripts (languages). Finally, there are some specification for this tasks, it is called the "Unicode Collation Algorithm", which can be found at http://www.unicode.org/reports/tr10/ . You can find more information about this on this question I posted http://programmers.stackexchange.com/questions/257286/is-there-any-language-agnostic-specification-for-string-natural-sorting-order

Final conclusion

So considering the current level of support provided by the javascript custom implementations I came across, we will probably never see anything getting any close to supporting all this characters & scripts (languages). Hence I would rather use the browsers' native localeCompare() method. Yes, it does have the downside of beeing non-consistent across browsers but basic testing shows it covers a much wider range of characters, allowing solid & meaningful sort orders.

Further reading:

http://programmers.stackexchange.com/questions/257286/is-there-any-language-agnostic-specification-for-string-natural-sorting-order http://stackoverflow.com/questions/51165/how-do-you-do-string-comparison-in-javascript http://stackoverflow.com/questions/2802341/natural-sort-of-text-and-numbers-javascript http://stackoverflow.com/questions/4373018/sort-array-of-numeric-alphabetical-elements-natural-sort/ http://stackoverflow.com/questions/4340227/sort-mixed-alpha-numeric-array https://web.archive.org/web/20130929122019/http://my.opera.com/GreyWyvern/blog/show.dml/1671288 https://web.archive.org/web/20131005224909/http://www.davekoelle.com/alphanum.html http://snipplr.com/view/36012/javascript-natural-sort/ http://blog.codinghorror.com/sorting-for-humans-natural-sort-order/

Mottie commented 9 years ago

The sorting of special characters follows the ASCII order of the characters, so characters will always be sorted in this order (space is first):

 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

If you are looking to change the order of these characters, I would suggest trying a library like sugarjs which allows defining a AlphanumericSortOrder as follows:

// default setting
Array.AlphanumericSortOrder = 'AÁÀÂÃĄBCĆČÇDĎÐEÉÈĚÊËĘFGĞHıIÍÌİÎÏJKLŁMNŃŇÑOÓÒÔPQRŘSŚŠŞTŤUÚÙŮÛÜVWXYÝZŹŻŽÞÆŒØÕÅÄö'

So using this library to change the sort order from an ASCII string order, this will be the result (demo):

Array.AlphanumericSortIgnoreCase = false;
Array.AlphanumericSortOrder = 'AaÁáÀàÂâÃãĄąBbCcĆćČčÇçDdĎďÐðEeÉéÈèĚěÊêËëĘęFfGgĞğHhıIiÍíÌìİÎîÏïİJjKkLlŁłMmNnŃńŇňÑñOoÓóÒòÔôPpQqRrŘřSsŚśŠšŞşTtŤťUuÚúÙùŮůÛûÜüVvWwXxYyÝýZŹźŻżŽžÞþÆ挜ØøÕõÅåÄäöö0123456789 !"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~';

// "AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZ0123456789!\"#$%&'()*+,-./:;<=>?@[\]^_`z{|}~"
0xadri commented 9 years ago

"The sorting of special characters follows the ASCII order of the characters": what are you refering to? this very library (javve's natural-sort) ?

If you refer to this javve's natural-sort implementation then you can see in this live demo that it does not follow this order : http://jsbin.com/voyidemuwicu/1/edit?js,console

"If you are looking to change the order of these characters, I would suggest trying a library like sugarjs" : thank you, it looks interesting. I will test it & try it when I have time.

Mottie commented 9 years ago

what are you refering to? this very library (javve's natural-sort) ?

Most sort algorithms fall back to use basic string comparisons which use unicode/ASCII values. For example:

"a".charCodeAt(0) // 97 (decimal value)
"A".charCodeAt(0) // 65 (decimal value)
"a" > "A"  // true

If you look at the wikipedia ASCII table I linked to you'll see the above values.

If you refer to this javve's natural-sort implementation then you can see in this live demo that it does not follow this order : http://jsbin.com/voyidemuwicu/1/edit?js,console

Yes, it does follow the ASCII order. Please look again.

0xadri commented 9 years ago

"Most sort algorithms fall back to use basic string comparisons which use unicode/ASCII values."

Alright, I understand what you mean now.

"Yes, it does follow the ASCII order. Please look again."

When I ran the code on jsbin (I'm on Chrome 37), it outputs the following: [" ", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "!", "\"", "#", "$", "%", "&", "'", "(", ")", "*", "+", ",", "-", ".", "/", ":", ";", "<", "=", ">", "?", "@", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z", "[", "\", "]", "^", "_", "`", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "{", "|", "}"]

Whereas the ASCII sort order seems to be (taking the sort order you pasted) !"#$%&'()*+,-./0123456789:;=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~</code

So maybe I am misunderstanding somethiing again?

Mottie commented 9 years ago

Sorry, you are right. That order is different because of the natural sort code. The following line forces numbers to have a lower value than non-numeric strings.

// handle numeric vs string comparison - number < string - (Kyle Adams)
if (isNaN(oFxNcL) !== isNaN(oFyNcL)) { return (isNaN(oFxNcL)) ? 1 : -1; }

If you shift the "number" results to be after the / then the result will be in ACSII order.

Look at the result when you include a basic sort in that demo.

0xadri commented 9 years ago

Interesting :) Well at least that'll please us (developers) to be able to get a nice ASCII sort order.

However, as explained in this article by Jeff Atwood (the founder of the StackExchange network), this might not be good enough for your user http://blog.codinghorror.com/sorting-for-humans-natural-sort-order/

Plenty of further information on this subject on this Programmers.Stackexchange's question I asked: http://programmers.stackexchange.com/questions/257286/is-there-any-language-agnostic-specification-for-string-natural-sorting-order