nitely / nim-regex

Pure Nim regex engine. Guarantees linear time matching
https://nitely.github.io/nim-regex/
MIT License
225 stars 20 forks source link

New API #111

Closed nitely closed 11 months ago

nitely commented 2 years ago

API spec:

func re2(s: string): Regex2
func re2(s: static string): static[Regex2]
func group(m: RegexMatch2; i: int): Slice[int]
func group(m: RegexMatch2; s: string): Slice[int]
func groupCount(m: RegexMatch2): int
func groupNames(m: RegexMatch2): seq[string]
func match(s: string; pattern: Regex2): bool
func match(s: string; pattern: Regex2; m: var RegexMatch2; start = 0): bool
[func,iterator] findAll(s: string; pattern: Regex; start = 0): seq[RegexMatch2]
func find(s: string; pattern: Regex2; m: var RegexMatch2; start = 0): bool
[func,iterator] capture(s: string; pattern: Regex): seq[string]
func contains(s: string; pattern: Regex2): bool
[func,iterator] split(s: string; sep: Regex2): seq[string]
[func,iterator] splitIncl(s: string; sep: Regex2): seq[string]
func startsWith(s: string; pattern: Regex2; start = 0): bool
func endsWith(s: string; pattern: Regex2): bool
func replace(s: string; pattern: Regex2; by: string; limit = 0): string
func replace(s: string; pattern: Regex2; by: proc (m: RegexMatch2; s: string): string; limit = 0): string 
func isInitialized(re: Regex2): bool
func escapeRe(s: string): string
macro match(text: string; regex: RegexLit; body: untyped): untyped

The Captures all group repetitions (not just the last one) feature is removed, we capture the last repetition. This is a breaking change, and it will break some of the APIs. The rest of APIs are deprecated or removed.

nitely commented 11 months ago

Changes to support both the old APIs and new APIs for a while:

nitely commented 11 months ago

122 is merged

nitely commented 11 months ago

I think I've not given the rational to remove the Captures all group repetitions (not just the last one) feature anywhere, so I'll do it here.

In order to capture all of the repetitions in re"(\w)+" a full parse tree of submatch (capture group) boundaries needs to be generated. The tree is usually small except when it's not. The main issue is the space complexity is O(N*M) where N is the text length, and M is the regex length. While this is not unbounded, it may be prohibitive, more so when matching untrusted text. Keeping only the last repetition submatch makes space complexity O(N*M) where N is the regex length and M the number of submatches (both usually known at compile time).

Why not provide both options? It's a lot of additional complexity.

What if I need all captures? You can do as in the rest of languages, match and then findAll.