Open liamquin opened 2 months ago
When you compare a string against a regex there are in general multiple matches, each one of which has multiple captured groups. These days we could represent the full complexity of the result using maps and arrays, but back in the day we chose to represent it using XML (see fn:analyze-string). It's not clear to me that the result using maps and arrays would be more usable than the result represented as XML, and in fact delivering the result in the form of the original string augmented with markup has some real benefits.
Maybe fn:analyze-string() is enough. It’s true that the XML markup could be extended in the ways i mentioned, if needed, and also that being able to use *:match makes using analyze-string a little easier. However, in practice i’ll continue to use a wrapper for getting at just the matched groups.
It would be nice indeed to have a more lightweight alternative to fn:analyze-string
(which people struggle with). The function provides much more functionality than what most users need for everyday tasks. We could keep it simple, do what (I believe) most other languages do and return a flat sequence, for example ("ab", "b")
for match-groups('xabx', '(a(b))')
?
If #37 is not dropped, we could then do:
let $date := "2010-10-10"
let $pattern := "^(\d{4})-(\d{2})-(\d{2})$"
let ($year, $month, $day) := match-groups($date, $pattern)
Currently, it would e.g. be:
let $groups := data(analyze-string($date, $pattern)/fn:match/fn:group)
let $year := $groups[1]
let $month := $groups[2]
let $day := $groups[3]
@ChristianGruen very much so. Although maybe we ended up with a "with prefix" expression (i lost track, i know i raised it as an issue!) in which case, in environments without fn predeclared,
let ($year, $month, $day) := (with prefix "fn" := "http://www.w3.org/2005/xpath-functions" return analyze-string($input, $regex)/fn:match/text())
but this is not as nice.
I have an extract function in Xidel that only returns the matched text and the third parameter lets one choose the returned capture groups (I always thought it to be faster to only return the data one needs)
extract(
"It was in the January of 1836 that she set out.",
"(January|February|March|April...).*(\d\d\d\d)"
)
returns January of 1836
extract(
"It was in the January of 1836 that she set out.",
"(January|February|March|April...).*(\d\d\d\d)"
, 1 to 2)
returns ("January", "1836")
extract(
"It was in the January of 1836 that she set out.",
"(January|February|March|April...).*(\d\d\d\d)"
, (0,1,2,2,2,2))
returns ("January of 1836", "January", "1836", "1836", "1836", "1836")
extract(
"It was in the January of 1836 that she set out.",
"(\w+)"
, 1)
returns "It"
And my function can also return all matches together in a sequence:
extract(
"It was in the January of 1836 that she set out.",
"(\w+)"
, 1, "*")
return ("It", "was", "in", "the", "January", "of", "1836", "that", "she", "set", "out")
extract(
"It was in the January of 1836 that she set out.",
"(\w)(\w+)"
, (1,2), "*")
return ("I", "t", "w", "as", "i", "n", "t", "he", "J", "anuary", "o", "f", "1", "836", "t", "hat", "s", "he", "s", "et", "o", "ut",)
@benibela great to see you here. I think your function would be fine, although maybe 0 or -1 would be better than "", so that the argument could be specified as xs:integer
think your function would be fine, although maybe 0 or -1 would be better than "", so that the argument could be specified as xs:integer
there is more
All regex functions have a flags parameter, e.g. "i" for case insensitive.
That is where I put the *
option. Like "i*"
and it returns all matches case insensitively.
Summary:
Please add a function that works like fn:matches() except that it returns a sequence corresponding to the capture groups in the regular expression.
Rationale:
One use of regular expressions is to pick items out of text. The current way to do that is to use a regular expression like ^.(stuff i want).$ and to replace this with "$1", and repeat for each capture group in the expression($2, $3, and so on). This is hard to maintain and read, and of course for N capture groups it's N times slower than one might want.
Other possible designs:
(1) return an array, to allow for the possibility of repeated capture groups, e.g. (abc)+, each producing a sequence
(2) return a map with integer keys, where 0 is the matched text, 1 is the first capture group, and so on. This has the advantage that we could add named capture groups, e.g. (?P\d\d\d\d) (but named capture groups in most libraries/languages seem to use angle brackets), and have the corresponding entries in the returned hash. It can also cope with repeated capture groups by having the map value be a sequence.
In other languages:
In Python you can save the result of a match in a variable, amanda, say, and use amanda.group(3) to get the third captured match. You can also write amanda.group('year') for a named capture buffer.
In Perl you can write, my @items = ($input =~ m/$regex/); to get an array. You can refer to named capture groups with an oddly named variable $+, so, $+{year) and so on.
In .Net, like Python, you can use the Groups property of a Match.
So the .Net and Python approaches lead me to prefer the idea of a map, but here is a proposal with just a sequence:
Question:
Why is there no collation argument for regular expression functions?
Proposal:
5.6.7 fn:match-groups
Summary
Signature
Properties
Rules
Error Conditions
Examples