moodymudskipper / unglue

Extract matched substrings using a pattern, similar to what package glue does in reverse
GNU General Public License v3.0
159 stars 2 forks source link

Feature Request: Pattern functions #25

Open wdkrnls opened 4 years ago

wdkrnls commented 4 years ago

Great package! This saves me from having to leave R for many tasks. I'm curious if you think it would be reasonable to support pattern functions similar to those provided by the TXR pattern munging language https://www.nongnu.org/txr/? This would be in addition to regular expressions. For example, I might want to ensure a patterns could be matched against a vector of options.

Imagine I have the data:

The green lawn. The red chair. The blue box. The grey cat.

I define the pattern function:

known_color = function(x) x %in% c("green", "red", "blue", "grey")

Then I can extract like:

unglue_data("The {color=known_color} {object}.")

Thanks for your consideration and great work!

moodymudskipper commented 4 years ago

Thanks!

I didn't know txr. It would be nice to be able to use it as is but I didn't find any interface in R.

You say unglue allows you not to leave R, when you did have to leave R, was it to use txr ?

A link for future ref : https://www.nongnu.org/txr/txr-pattern-language.html

Your proposed syntax can't work as is because it should match the exact string "known_color" here. Also as I believe you allude to, it works on top of regular expressions so there needs to be a spot to mention this regex.

Given the function should return a boolean we could use the /character to mean "if" like in probability theory. So we'd have:

unglue_data(input, "The {color/known_color} {object}.")

Or with explicit regex :

unglue_data(input, "The {color/known_color=.*?} {object}.")

Would it answer your needs? Do you think it's intuitive?

Note: I can't do :

unglue_data(input, "The {color=.*?/known_color} {object}.")

Because it doesn't unambiguously tell me the regex isn't the full ".*?/known_color"

moodymudskipper commented 4 years ago

Note that this example can be solved with :

unglue_data(input, "The {color=(green)|(red)|(blue)|(grey)} {object}.")

Or if we want to define it separately :

known_color_pattern <- "(green)|(red)|(blue)|(grey)" 
unglue_data(input, sprintf("The {color=%s} {object}.", known_color_pattern)) 

Can you think of a use case where the above wouldn't be satisfying? I prefer not to complexity unglue if the added value is not clear.

wdkrnls commented 4 years ago

I gave a poor example. Enumerating known cases is pretty convenient to do in R as you have shown. However, the TXR pattern function approach is way more powerful when you cannot enumerate the options and they cannot be described by a regular expression. I really liked your conditional syntax for boolean functions with /. That would be getting far closer to the power of the TXR approach.