Add a natural sort inside std/library

Alogani commented 4 months ago

Summary

Hello,

The standard library is laking a natural sort algorithm. Here is a possible implementation :

proc naturalSort*(l: openArray[string]): seq[string] =
    l.sorted(
        proc(a, b: string): int =
            var ai = 0
            var bi = 0
            while true:
                if ai > high(a) or bi > high(b):
                    return a.len() - ai - b.len() + bi
                if not (a[ai].isDigit() and b[bi].isDigit()):
                    let diff = cmp(a[ai], b[bi])
                    if diff != 0:
                        return diff
                    ai += 1; bi += 1
                else:
                    var
                        aNum: int
                        bNum: int
                    ai += parseInt(a[ai .. ^1], aNum)
                    bi += parseInt(b[bi .. ^1], bNum)
                    let diff = cmp(aNum, bNum)
                    if diff != 0:
                        return diff
    )

Thanks !

Description

Definition : In computing, natural sort order (or natural sorting) is the ordering of strings in alphabetical order, except that multi-digit numbers are treated atomically, i.e., as if they were a single character. Natural sort order has been promoted as being more human-friendly ("natural") than machine-oriented, pure alphabetical sort order.

It could be a nice addition, as it is can be quite common to use it for file versionning, reports, etc.

Alternatives

No response

Examples

No response

Backwards Compatibility

No response

Links

No response

johnnovak commented 4 months ago

+1, I was missing this too the other day. Please make sure to add both ASCII and UTF8 variants.

Alogani commented 4 months ago

Here is the utf8 version, of course much slower, because it implies some conversions (there is certainly a way to do parseInt without reconverting to string) :

import std/[algorithm, strutils, parseutils, unicode]

proc naturalSort*(l: openArray[seq[Rune]]): seq[seq[Rune]] =
    l.sorted(
        proc(a, b: seq[Rune]): int =
            var ai = 0
            var bi = 0
            while true:
                if ai > high(a) or bi > high(b):
                    return a.len() - ai - b.len() + bi
                if a[ai].isAlpha() or b[bi].isAlpha():
                    let diff = if a[ai] == b[bi]:
                            0
                        elif a[ai] <% b[bi]:
                            -1
                        else:
                            1
                    if diff != 0:
                        return diff
                    ai += 1; bi += 1
                else:
                    var
                        aNum: int
                        bNum: int
                    ai += parseInt($(a[ai .. ^1]), aNum)
                    bi += parseInt($(b[bi .. ^1]), bNum)
                    let diff = cmp(aNum, bNum)
                    if diff != 0:
                        return diff
    )

var l = @["d", "a", "cdrom1","cdrom10","cdrom102","cdrom11","cdrom2","cdrom20","cdrom3","cdrom30","cdrom4","cdrom40","cdrom100","cdrom101","cdrom103","cdrom110"]
var l2 = newSeq[seq[Rune]](l.len())
for i in 0..<l.len():
    l2[i] = l[i].toRunes()
echo l2.naturalSort()

johnnovak commented 4 months ago

Thanks @Alogani! I'm gonna copy/paste this into my project until it hits the stdlib 👍🏻

johnnovak commented 4 months ago

I've adjusted this to expose cmpNatural and cmpNaturalIgnoreCase as well. That's very handy when creating your own sort functions, so it would be nice to include them in the stdlib implementation.

So probably we'd need naturalSortIgnoreCase as well.

Update: This needs more work @Alogani. This input for example gets the Unicode variant into an infinite loop:

var b = @["!a", "[b"]

If you want to provide the final implementation, I suggest covering all scenarios, including edge cases, with unit tests.

Update 2:

This fix seems to do the trick. Just use if not (a[ai].isDigit and b[bi].isDigit) from the ASCII implementation instead of isAlpha, and here's the missing isDigit function for Rune:

proc isDigit*(r: Rune): bool =
  ord(r) >= ord('0') and ord(r) <= ord('9')

Alogani commented 4 months ago

Awesome, thanks for your addition ! I should have mentionned I was not as confident about my solution, because I don't work as much with utf8 (thinks for Nim developpers to stick with ascii !). I didn't take it seriously, because there is no reason ascii sort and utf8 sort produce different outputs.

But I accept the challenge :-)

Here is the complete code with some tests. And it doesn't it use anymore ascii parseInt ! (PS: thanks for your isDigit implementation, it was the inspiration needed for me to write the whole code !)

Old solution (see below for ignoreCase implementation)

[codeRemoved] I didn't test it with invalid utf8 strings, but I don't think it will raise. (Todo: add a test for that case.)

Update -> add IgnoreCaseImpl

import std/[algorithm, strutils, parseutils, unicode]

func cmpIgnoreCase(a, b: char): int =
    ord(toLowerAscii(a)) - ord(toLowerAscii(b))

func cmp(a, b: Rune): int =
    a.int - b.int

func cmpIgnoreCase(a, b: Rune): int =
    a.toLower().int - b.toLower().int

func naturalSortImpl(l: openArray[string], ignoreCase: bool): seq[string] =
    let comparator: proc(a, b: char): int = if ignoreCase: cmpIgnoreCase else: cmp
    l.sorted(
        proc(a, b: string): int =
            var ai = 0
            var bi = 0
            while true:
                if ai > high(a) or bi > high(b):
                    return a.len() - ai - b.len() + bi
                if not (a[ai].isDigit() and b[bi].isDigit()):
                    let diff = comparator(a[ai], b[bi])
                    if diff != 0:
                        return diff
                    ai += 1; bi += 1
                else:
                    var
                        aNum: int
                        bNum: int
                    ai += parseInt(a[ai .. ^1], aNum)
                    bi += parseInt(b[bi .. ^1], bNum)
                    let diff = cmp(aNum, bNum)
                    if diff != 0:
                        return diff
    )

func naturalSort*(l: openArray[string]): seq[string] =
    naturalSortImpl(l, false)

func naturalSortIgnoreCase*(l: openArray[string]): seq[string] =
    naturalSortImpl(l, true)

func isDigit*(r: Rune): bool =
  ## Naive algorithm, for more info see: https://github.com/nim-lang/Nim/issues/23462#issuecomment-2027971059 or #23465 
  ord(r) >= ord('0') and ord(r) <= ord('9')

proc integerOutOfRangeError() {.noinline.} =
  raise newException(ValueError, "Parsed integer outside of valid range")

proc rawParseInt(s: openArray[Rune], b: var BiggestInt): int =
    var
        sign: BiggestInt = -1
        i = 0
    if i < s.len:
        if s[i] == '+'.Rune: inc(i)
        elif s[i] == '-'.Rune:
            inc(i)
            sign = 1
    if i < s.len and (let codePoint = ord(s[i]); if codePoint < high(char).int: codePoint.char in {'0'..'9'} else: false):
        b = 0
        while i < s.len and (let codePoint = ord(s[i]); if codePoint < high(char).int: codePoint.char in {'0'..'9'} else: false):
            let c = ord(s[i]) - ord('0')
            if b >= (low(BiggestInt) + c) div 10:
                b = b * 10 - c
            else:
                integerOutOfRangeError()
            inc(i)
            while i < s.len and s[i] == '_'.Rune: inc(i) # underscores are allowed and ignored
        if sign == -1 and b == low(BiggestInt):
            integerOutOfRangeError()
        else:
            b = b * sign
            result = i

func naturalSortImpl*(l: openArray[seq[Rune]], ignoreCase: bool): seq[seq[Rune]] =
    let comparator: proc(a, b: Rune): int = if ignoreCase: cmpIgnoreCase else: cmp
    l.sorted(
        proc(a, b: seq[Rune]): int =
            var ai = 0
            var bi = 0
            while true:
                if ai > high(a) or bi > high(b):
                    return a.len() - ai - b.len() + bi
                if not(a[ai].isDigit() and b[bi].isDigit()):
                    let diff = comparator(a[ai], b[bi])
                    if diff != 0:
                        return diff
                    ai += 1; bi += 1
                else:
                    var
                        aNum: Biggestint
                        bNum: Biggestint
                    ai += rawParseInt(a[ai .. ^1], aNum)
                    bi += rawParseInt(b[bi .. ^1], bNum)
                    let diff = cmp(aNum, bNum)
                    if diff != 0:
                        return diff
    )

func naturalSort*(l: openArray[seq[Rune]]): seq[seq[Rune]] =
    naturalSortImpl(l, false)

func naturalSortIgnoreCase*(l: openArray[seq[Rune]]): seq[seq[Rune]] =
    naturalSortImpl(l, true)

import mylib/testing

testing():
    func toAsciiList(runesList: seq[seq[Rune]]): seq[string] =
        result = newSeqofCap[string](runesList.len())
        for sentence in runesList:
            result.add $sentence

    func toRuneList(strList: seq[string]): seq[seq[Rune]] =
        result = newSeqofCap[seq[Rune]](strList.len())
        for sentence in strList:
            result.add sentence.toRunes()

    func withRuneSort(strList: seq[string], ignoreCase: bool): seq[string] =
        if ignoreCase:
            strList.toRuneList().naturalSortIgnoreCase().toAsciiList()
        else:
            strList.toRuneList().naturalSort().toAsciiList()

    runTest("parseInt"):
        ## The resulting value is not tested, but if parsing size is the same, it should be ok
        var resInt: BiggestInt
        doAssert rawParseInt("12323dsdsd".toRunes(), resInt) == parseBiggestInt("12323dsdsd", resInt)
        doAssert rawParseInt("fdf+34".toRunes(), resInt) == parseBiggestInt("fdf+34", resInt)
        doAssert rawParseInt("-32lfd".toRunes(), resInt) == parseBiggestInt("-32lfd", resInt)
        doAssert rawParseInt("+3sd2lfd".toRunes(), resInt) == parseBiggestInt("+3sd2lfd", resInt)
    runTest("ascii sort"):
        var l_ascii = @["d", "a", "cdrom1","cdrom10","cdrom102","cdrom11","cdrom2","cdrom20","cdrom3","cdrom30","cdrom4","cdrom40","cdrom100","cdrom101","cdrom103","cdrom110"]
        var solution = @["a", "cdrom1", "cdrom2", "cdrom3", "cdrom4", "cdrom10", "cdrom11", "cdrom20", "cdrom30", "cdrom40", "cdrom100", "cdrom101", "cdrom102", "cdrom103", "cdrom110", "d"]
        let sortedList_ascii = l_ascii.naturalSort()
        doAssert sortedList_ascii == solution
        doAssert l_ascii.withRuneSort(false) == sortedList_ascii
    runTest("ascii sortInsenstive"):
        var l_ascii = @["d", "a", "cDrom1","cdrom10","cdrOm102","cDrom11","cdrom2","cdroM20","cdROM3","cdrom30","CDrom4","cdrom40","cdrom100","cdrom101","cdrom103","cdrom110"]
        var solution = @["a", "cDrom1", "cdrom2", "cdROM3", "CDrom4", "cdrom10", "cDrom11", "cdroM20", "cdrom30", "cdrom40", "cdrom100", "cdrom101", "cdrOm102", "cdrom103", "cdrom110", "d"]
        let sortedList_ascii = l_ascii.naturalSortIgnoreCase()
        doAssert sortedList_ascii == solution
        doAssert l_ascii.withRuneSort(true) == sortedList_ascii
    runTest("CornerCase1"):
        doAssert @["!a", "[b"].withRuneSort(false) == @["!a", "[b"]
    runTest("Special characters"):
        var strings = @["Héllo Wørld!", "Jåpånēsē 日本語", "Ça va bien.", "Привет, мир!", "Γειά σου κόσμε!", "مرحباً بالعالم!", "नमस्ते दुनिया!", "안녕하세요!", "שלום עולם!", "こんにちは世界!"]
        ## According to python:
        var solution = @["Héllo Wørld!", "Jåpånēsē 日本語", "Ça va bien.", "Γειά σου κόσμε!", "Привет, мир!", "שלום עולם!", "مرحباً بالعالم!", "नमस्ते दुनिया!", "こんにちは世界!", "안녕하세요!"]
        doAssert strings.withRuneSort(false) == solution
        doAssert strings.withRuneSort(true) == solution
        doAssert strings.naturalSort() == solution
    runTest("Special characters and nums"):
        var strings = @["日本10語", "日本1043語", "日本104343語", "日本2語", "日本20語"]
        var solution = @["日本2語", "日本10語", "日本20語", "日本1043語", "日本104343語"]
        doAssert strings.withRuneSort(false) == solution
        doAssert strings.withRuneSort(true) == solution

PS: I didn't use cmpIgnoreCase for seq[Rune] to avoid a useless conversion, but I reuse the same code. It doesn't seem system.cmp[char] makes a conversion to string, but I can't affirm it for sure.

johnnovak commented 4 months ago

Good stuff @Alogani, I will test this tomorrow. I raised a new ticket for adding isDigit for Rune, but looks like it's a bit more complicated to support it 100% correctly.

See here: https://github.com/nim-lang/Nim/issues/23465

But then, for 99% use cases I think my implementation does the job...

Alogani commented 4 months ago

I see that ticket thanks !

I still used isDigit in my code, because i don't think that special digits characters will really bother a natural sort algorithm (after all, is it really correct to include non standard digits in a natural sort ? I don't know if other language does)

To include it in the code, it would need both a isDigit(r: rune): bool and toDigit(r: rune): range[0..9] conversion funcs that works with non standards

Alogani commented 4 months ago

I added my code for non standards zeros here : https://github.com/nim-lang/Nim/issues/23465#issuecomment-2027970284

I have also modified the implementation to handle them :

proc rawParseInt(s: openArray[Rune], b: var BiggestInt): int =
    var
        sign: BiggestInt = -1
        i = 0
    if i < s.len:
        if s[i] == '+'.Rune: inc(i)
        elif s[i] == '-'.Rune:
            inc(i)
            sign = 1

    if i < s.len:
        b = 0
        while i < s.len and (let c = toDigitImpl(s[i]); c in {0..9}):
            if b >= (low(BiggestInt) + c) div 10:
                b = b * 10 - c
            else:
                integerOutOfRangeError()
            inc(i)
            while i < s.len and s[i] == '_'.Rune: inc(i) # underscores are allowed and ignored
        if sign == -1 and b == low(BiggestInt):
            integerOutOfRangeError()
        else:
            b = b * sign
            result = i

It needs some unit tests. There is also a different behaviour than parseInt, because in this version b is set to 0 even if no int is parsed (for efficicency reasons), contrary to original implementation.

johnnovak commented 4 months ago

Great progress @Alogani. I've reshuffled things slightly so the natural comparators are also exposed. I need that to build natural sorting into my more complex custom comparators that work on objects.

E.g. for the ASCII variant:

template cmpNaturalImpl(a, b: string, comparator: untyped): auto =
  var ai = 0
  var bi = 0
  while true:
    if ai > high(a) or bi > high(b):
      return a.len() - ai - b.len() + bi
    if not (a[ai].isDigit() and b[bi].isDigit()):
      let diff = comparator(a[ai], b[bi])
      if diff != 0:
        return diff
      ai += 1; bi += 1
    else:
      var
        aNum: int
        bNum: int
      ai += parseInt(a[ai .. ^1], aNum)
      bi += parseInt(b[bi .. ^1], bNum)
      let diff = cmp(aNum, bNum)
      if diff != 0:
        return diff

func cmpNatural*(a, b: string): int =
  cmpNaturalImpl(a, b, cmp)

func cmpNaturalIgnoreCase*(a, b: string): int =
  cmpNaturalImpl(a, b, cmpIgnoreCase)

func naturalSort*(l: openArray[string]): seq[string] =
  l.sorted(cmpNatural)

func naturalSortIgnoreCase*(l: openArray[string]): seq[string] =
  l.sorted(cmpNaturalIgnoreCase)

nim-lang / Nim