Result of pegs.split - Githubissues

dinau commented 1 year ago

Description

I've found unexpected behavior of pegs.split() proc comparing with other regex libraries,as follows:

import std/[pegs,re]
import regex

const str = "    abcd"

# PEG
assert str.split(peg"\s+") == @["abcd"]
# re,regex
const res1 = @["", "abcd"]
assert str.split(re.re"\s+") == res1
assert str.split(regex.re"\s+") == res1

# PEG
assert str.split(peg"\s") == @[" ", " abcd"]
# re,regex
const res2 = @["", "", "", "", "abcd"]
assert str.split(re.re"\s") == res2
assert str.split(regex.re"\s") == res2

Are these deferences issue or specification ?

Nim Version

Current Output

No assertion error.

Expected Output

Assertion error at peg.split proc ?

Possible Solution

No response

Additional Information

It seems that from nim-0.19.6 the same behavior has occured.

lilkeet commented 1 year ago

"    abcd".split(peg"\s+") == @["abcd"]

is an improvement upon regex imo. im not sure why the regex wouldn't consume the entirely of the spaces.

encountering multiple occurrences of ur separator leaves the exact course of action up to the user, but

"    abcd".split(peg"\s") == @[" ", " abcd"]

is definitely incorrect lol. we can do better. I rewrote a more correct pegs.split function below:

import pegs

iterator mySplit(s: string, sep: Peg): string =
  ## Splits the string `s` into substrings.
  ##
  ## Substrings are separated by the PEG `sep`.
  ## Examples:
  ##
  ## .. code-block:: nim
  ##   for word in split("00232this02939is39an22example111", peg"\d+"):
  ##     writeLine(stdout, word)
  ##
  ## Results in:
  ##
  ## .. code-block:: nim
  ##   "this"
  ##   "is"
  ##   "an"
  ##   "example"
  ##

  func usefulMatch(s: string): Natural =
    # rawMatch normally returns -1 on a failed match, but we would rather
    # it return a 0 since we can't have an index of -1.
    # Returns the distance to the end of the `sep` peg.
    var c: Captures
    let matchLen = rawMatch(s, sep, 0, c)
    return if matchlen == -1: 0
           else:              matchLen

  func nextMatch(s: string): Natural =
    # Returns the distance
    result = 0
    var c: Captures

    while rawMatch(s, sep, result, c) == -1 and result < s.len:
      inc result

  var holder = s

  while holder.len != 0:
    let matchLen = usefulMatch holder
    holder = holder[matchLen..^1]

    let notMatchLen = nextMatch holder

    yield holder[0..notMatchLen-1]

    holder = holder[notMatchLen..^1]

func mySplit(s: string, sep: Peg): seq[string] =
  for nonmatching in s.mySplit(sep):
    result.add nonmatching

assert "    abcd".mySplit(peg"\s+") == @["abcd"]
assert "    abcd".mySplit(peg"\s") == @["", "", "", "abcd"]

The second case produces three empty strings instead of regex's four. Should it get four? I'm not sure. Oh and I'd bet that my function is 10x slower than the original so optimize before throwing it into production :)

lilkeet commented 1 year ago

add the line

if holder.len == 0: break

above the yield statement if u want to be consistent on the font and back ends of the string.

nim-lang / Nim