obdurodon / dh_course

Digital Humanities course site
GNU General Public License v3.0
20 stars 6 forks source link

Using Regular Expressions in XQuery #190

Closed gabikeane closed 5 years ago

gabikeane commented 5 years ago

I'm working on Homestuck for the third XQuery assignment, and I'm in a spot where something like regex-group() would be very useful:

I'm trying to extract a regular expression from caption elements that match '^"([A-Z][a-z]+): .*"' (in other words, create an XPath expression that finds a caption like '"John: I'm talking right now!"' and returns the substring "John").

Right now, what I've got is

for $capts in doc('/db/course/homestuck/2012-10-31_homestuck.xml')//caption[matches(., '^"([A-Z][a-z]+): .*"')]/tokenize(., '"|:')[not(contains(., ' '))][matches(., '^[A-Z][a-z]+?')]
return $capts

But what I would really like is something like regex-group(1) so that I can use just the ([A-Z][a-z]+) group in my regular expression, which locates the string very precisely, instead of everything where the first letter is capitalized and isolated by quotes and colons (for example, '"John pushes the button marked "Mix"."' will allow "Mix" to make it through all of my above filters). Is there a way to use regex-group() or otherwise use regular expressions in string-formation?

XQuery is not very happy with me when I try, e.g.:

for $capts in distinct-values(doc('/db/course/homestuck/2012-10-31_homestuck.xml')//caption[matches(., '^"([A-Z][a-z]+): .*"')]/regex(1)
return $capts

This does not surprise me at all, as the predicate has already fulfilled its function of filtering and there's no reason the regular expression should be accessible from across the front-slash, where we are orienting back toward a selection of captions, not regular expressions.

To cut an overly long and unnecessarily code-filled post short, I wonder if there's a way to get to a regular expression--or anything else--that will let me be precise in the way that I want, here.

gabikeane commented 5 years ago

Instructor Part of your problem is a missing close paren at the end of your for line. As for the rest of your problem, I think you'll need to use replace() instead of (or possibly in combination with) matches(), but I couldn't get the captured pattern to work.

I used http://en.wikibooks.org/wiki/XQuery/Regular_Expressions#Examples_of_repl... as a reference, but no dice so far...

gabikeane commented 5 years ago

Student Thank you for the insight, Janis. I replaced the tokenize() function with a replace() function, and it did exactly what I was asking for help doing.

for $capts in distinct-values(doc('/db/course/homestuck/2012-10-31_homestuck.xml')//caption[matches(., '^"([A-Z][a-z]+): .*"')]/replace(., '^"([A-Z][a-z]+): .*"', '$1'))

return $capts

In the end, having to work around this resulted in identifying the better environmental pattern for "PCs," but I think regular expressions will make this code look less gross. Thanks!

gabikeane commented 5 years ago

Instructor Pffff, as usual, it's the littlest errors that keep our code from working- I was missing the quotes around $1 when I was experimenting. Go figure.

Glad you got it to work! Thanks for sharing your answer. :D

gabikeane commented 5 years ago

Student As an update, I did resolve my problem just by using two variables of convenience.

for $intros in distinct-values(doc('/db/course/homestuck/2012-10-31_homestuck.xml')//p[contains(., 'Your name is')]/substring-before(., '.'))
for $chars in tokenize($intros, 'Your name is ')

This gets me the names of each of the characters isolated. But I'd still be interested in the possibility of using regular expressions out of predicates and parameters...