add fn:match-groups() function

liamquin commented 2 months ago

Summary:

Please add a function that works like fn:matches() except that it returns a sequence corresponding to the capture groups in the regular expression.

Rationale:

One use of regular expressions is to pick items out of text. The current way to do that is to use a regular expression like ^.(stuff i want).$ and to replace this with "$1", and repeat for each capture group in the expression($2, $3, and so on). This is hard to maintain and read, and of course for N capture groups it's N times slower than one might want.

Other possible designs:

(1) return an array, to allow for the possibility of repeated capture groups, e.g. (abc)+, each producing a sequence

(2) return a map with integer keys, where 0 is the matched text, 1 is the first capture group, and so on. This has the advantage that we could add named capture groups, e.g. (?P\d\d\d\d) (but named capture groups in most libraries/languages seem to use angle brackets), and have the corresponding entries in the returned hash. It can also cope with repeated capture groups by having the map value be a sequence.

In other languages:

In Python you can save the result of a match in a variable, amanda, say, and use amanda.group(3) to get the third captured match. You can also write amanda.group('year') for a named capture buffer.

In Perl you can write, my @items = ($input =~ m/$regex/); to get an array. You can refer to named capture groups with an oddly named variable $+, so, $+{year) and so on.

In .Net, like Python, you can use the Groups property of a Match.

So the .Net and Python approaches lead me to prefer the idea of a map, but here is a proposal with just a sequence:

Question:

Why is there no collation argument for regular expression functions?

Proposal:

5.6.7 fn:match-groups

Summary

Matches a string against a regular expression in the same way
as fn:matches() and with the same interpretation of flags,
but returns a sequence of strings, one for each
[capturing group] in the given regular expression.

Signature

fn:match-groups(
$value as xs:string?,
$pattern as xs:string,
$flags as xs:string? := ""
) as xs:string*

Properties

This function is ·deterministic·, ·context-independent·, and ·focus-independent·.

Rules

If $value is the empty sequence, it is interpreted as the zero-length string.

The function returns a sequence of strings, one for each captured
group in $pattern. If a group matched a zero-length sequence of characters
in $value, an empty string is returned at that position.

Error Conditions

A dynamic error is raised [err:FORX0002] if $pattern is invalid according to the rules described in 5.6.1 Regular expression syntax.

A dynamic error is raised [err:FORX0001] if $flags is invalid according to the rules described in 5.6.2 Flags.

Examples

fn:match-groups(
"It was in the January of 1836 that she set out.",
"(January|February|March|April...).*(\d\d\d\d)"
)

returns ("January", "1836")

fn:match-group(
    "(a*)(b+)(c*)",
"BBC",
"i"
)

returns ("", "BB", "C")

michaelhkay commented 2 months ago

When you compare a string against a regex there are in general multiple matches, each one of which has multiple captured groups. These days we could represent the full complexity of the result using maps and arrays, but back in the day we chose to represent it using XML (see fn:analyze-string). It's not clear to me that the result using maps and arrays would be more usable than the result represented as XML, and in fact delivering the result in the form of the original string augmented with markup has some real benefits.

liamquin commented 2 months ago

Maybe fn:analyze-string() is enough. It’s true that the XML markup could be extended in the ways i mentioned, if needed, and also that being able to use *:match makes using analyze-string a little easier. However, in practice i’ll continue to use a wrapper for getting at just the matched groups.

ChristianGruen commented 2 months ago

It would be nice indeed to have a more lightweight alternative to fn:analyze-string (which people struggle with). The function provides much more functionality than what most users need for everyday tasks. We could keep it simple, do what (I believe) most other languages do and return a flat sequence, for example ("ab", "b") for match-groups('xabx', '(a(b))')?

If #37 is not dropped, we could then do:

let $date := "2010-10-10"
let $pattern := "^(\d{4})-(\d{2})-(\d{2})$"
let ($year, $month, $day) := match-groups($date, $pattern)

Currently, it would e.g. be:

let $groups := data(analyze-string($date, $pattern)/fn:match/fn:group)
let $year := $groups[1]
let $month := $groups[2]
let $day := $groups[3]

liamquin commented 2 months ago

@ChristianGruen very much so. Although maybe we ended up with a "with prefix" expression (i lost track, i know i raised it as an issue!) in which case, in environments without fn predeclared,

let ($year, $month, $day) := (with prefix "fn" := "http://www.w3.org/2005/xpath-functions" return analyze-string($input, $regex)/fn:match/text())

but this is not as nice.

benibela commented 2 months ago

I have an extract function in Xidel that only returns the matched text and the third parameter lets one choose the returned capture groups (I always thought it to be faster to only return the data one needs)

extract(
"It was in the January of 1836 that she set out.",
"(January|February|March|April...).*(\d\d\d\d)"
)

returns January of 1836

extract(
"It was in the January of 1836 that she set out.",
"(January|February|March|April...).*(\d\d\d\d)"
, 1 to 2)

returns ("January", "1836")

extract(
"It was in the January of 1836 that she set out.",
"(January|February|March|April...).*(\d\d\d\d)"
, (0,1,2,2,2,2))

returns ("January of 1836", "January", "1836", "1836", "1836", "1836")

extract(
"It was in the January of 1836 that she set out.",
"(\w+)"
, 1)

returns "It"

And my function can also return all matches together in a sequence:

extract(
"It was in the January of 1836 that she set out.",
"(\w+)"
, 1, "*")

return ("It", "was", "in", "the", "January", "of", "1836", "that", "she", "set", "out")

extract(
"It was in the January of 1836 that she set out.",
"(\w)(\w+)"
, (1,2), "*")

return ("I", "t", "w", "as", "i", "n", "t", "he", "J", "anuary", "o", "f", "1", "836", "t", "hat", "s", "he", "s", "et", "o", "ut",)

liamquin commented 2 months ago

@benibela great to see you here. I think your function would be fine, although maybe 0 or -1 would be better than "", so that the argument could be specified as xs:integer

benibela commented 1 month ago

think your function would be fine, although maybe 0 or -1 would be better than "", so that the argument could be specified as xs:integer

there is more

All regex functions have a flags parameter, e.g. "i" for case insensitive.

That is where I put the * option. Like "i*" and it returns all matches case insensitively.

qt4cg / qtspecs

add fn:match-groups() function #1310