nim-lang / Nim

Nim is a statically typed compiled systems programming language. It combines successful concepts from mature languages like Python, Ada and Modula. Its design focuses on efficiency, expressiveness, and elegance (in that order of priority).
https://nim-lang.org
Other
16.52k stars 1.46k forks source link

Result of pegs.split #21213

Open dinau opened 1 year ago

dinau commented 1 year ago

Description

I've found unexpected behavior of pegs.split() proc comparing with other regex libraries,as follows:

import std/[pegs,re]
import regex

const str = "    abcd"

# PEG
assert str.split(peg"\s+") == @["abcd"]
# re,regex
const res1 = @["", "abcd"]
assert str.split(re.re"\s+") == res1
assert str.split(regex.re"\s+") == res1

# PEG
assert str.split(peg"\s") == @[" ", " abcd"]
# re,regex
const res2 = @["", "", "", "", "abcd"]
assert str.split(re.re"\s") == res2
assert str.split(regex.re"\s") == res2

Are these deferences issue or specification ?

Nim Version

Nim Compiler Version 1.6.10 [Windows: i386] Compiled at 2022-11-21 Copyright (c) 2006-2021 by Andreas Rumpf

Current Output

No assertion error.

Expected Output

Assertion error at peg.split proc ?

Possible Solution

No response

Additional Information

It seems that from nim-0.19.6 the same behavior has occured.

lilkeet commented 1 year ago
"    abcd".split(peg"\s+") == @["abcd"]

is an improvement upon regex imo. im not sure why the regex wouldn't consume the entirely of the spaces.

encountering multiple occurrences of ur separator leaves the exact course of action up to the user, but

"    abcd".split(peg"\s") == @[" ", " abcd"]

is definitely incorrect lol. we can do better. I rewrote a more correct pegs.split function below:

import pegs

iterator mySplit(s: string, sep: Peg): string =
  ## Splits the string `s` into substrings.
  ##
  ## Substrings are separated by the PEG `sep`.
  ## Examples:
  ##
  ## .. code-block:: nim
  ##   for word in split("00232this02939is39an22example111", peg"\d+"):
  ##     writeLine(stdout, word)
  ##
  ## Results in:
  ##
  ## .. code-block:: nim
  ##   "this"
  ##   "is"
  ##   "an"
  ##   "example"
  ##

  func usefulMatch(s: string): Natural =
    # rawMatch normally returns -1 on a failed match, but we would rather
    # it return a 0 since we can't have an index of -1.
    # Returns the distance to the end of the `sep` peg.
    var c: Captures
    let matchLen = rawMatch(s, sep, 0, c)
    return if matchlen == -1: 0
           else:              matchLen

  func nextMatch(s: string): Natural =
    # Returns the distance
    result = 0
    var c: Captures

    while rawMatch(s, sep, result, c) == -1 and result < s.len:
      inc result

  var holder = s

  while holder.len != 0:
    let matchLen = usefulMatch holder
    holder = holder[matchLen..^1]

    let notMatchLen = nextMatch holder

    yield holder[0..notMatchLen-1]

    holder = holder[notMatchLen..^1]

func mySplit(s: string, sep: Peg): seq[string] =
  for nonmatching in s.mySplit(sep):
    result.add nonmatching

assert "    abcd".mySplit(peg"\s+") == @["abcd"]
assert "    abcd".mySplit(peg"\s") == @["", "", "", "abcd"]

The second case produces three empty strings instead of regex's four. Should it get four? I'm not sure. Oh and I'd bet that my function is 10x slower than the original so optimize before throwing it into production :)

lilkeet commented 1 year ago

add the line

if holder.len == 0: break

above the yield statement if u want to be consistent on the font and back ends of the string.