Feature request: regex (or minimally, startswith/endswith) string tests

gojomo commented 11 years ago

String contains() is nice, but full regular-expression matching would be great. Or perhaps as smaller step: startswith/endswith or positional string-matching tests?

georgir commented 10 years ago

If we get substr() or characters in strings could be iterated, #192, we could easily implement startswith/endswith and the like ourselves. Regex functionality (both test and replace, first-match or global, with capturing subpattern matches and their positions) would be a great addition though.

jedisct1 commented 10 years ago

+1

nrk commented 10 years ago

I think full-blown regular expressions are probably a bit of an overkill but something like Lua's pattern matching would be fast and powerful enough for most tasks.

georgir commented 10 years ago

On first look that really seems to me the same as regexps with % instead of \ for escape. I do like their () solution for capturing position though.

stedolan commented 10 years ago

String processing in jq is not good at the moment. I really want to implement a PEG-style parsing system like LPeg or Parsec, which I find a bit more useful and readable than regexes. It's a bit of work though, and I haven't had a chance recently. In the meantime, a wrapper around libc's regex functions would probably be a useful stopgap.

amagura commented 10 years ago

+1 Support for GNU sed-like regex expression support.

nicowilliams commented 10 years ago

startswith/endswith and a few more goodies are now in master.

pkoppstein commented 10 years ago

As wonderful as jq is, I would argue that true greatness will require support for regular expressions with named captures. (*)

My impression is that the oniguruma open-source library (**) would be a very good match for jq.

Meanwhile, thank you stedolan !

(*) Ruby supports named captures very nicely, as illustrated by this interaction:

> /< *(?<email>[^> ]*)>/ =~ "Abe Lincoln <abe@whitehouse.gov>"
=> 12
> email
=> "abe@whitehouse.gov"

The creation and setting of local variables in this manner should be possible in jq, perhaps allowing jq expressions such as the following:

.people[].email | select( match( "(?

[^> ])>" )) | $address

Supporting regex with named captures in this manner would also allow parsing of dates in a way that would resolve the "sorting by non-numeric dates" issue mentioned above (#359).

(**) http://www.geocities.jp/kosako3/oniguruma is a stable, multi-platform open source regex library written in C that is used by ruby 1.9 and php.

wtlangford commented 10 years ago

I'm currently working on implementing this. Are we willing to add a hard dependency on an external library for regex support? Or would we rather make it more optional, such as a capability that has to be enabled at compile time? If we're okay with the hard dependency, I also like oniguruma. It seems to be readily available from package managers, both on various *nixes and brew for os x, and is small in size (the fedora package took up a whopping 671 KB.

nicowilliams commented 10 years ago

So far jq has been very portable largely because it has no external run-time dependencies -- just the C and math libraries. How will adding a hard dependency affect the Windows cross-compilation build, for example? I'm not sure, but I'd urge you to try a cross-compilation before doing a PR, just to make sure that I can build and release for Windows.

The way jq currently handles number parsing and printing is by including in the source tree what should really have been an external library (jv_dtoa.c). I was considering doing something like that for bignums: include libtommath (or libtomfloat, or whatever the name is that escapes me atm) in the jq source tree. You might do the same for regexp, but you need to be careful that the license of whatever library you choose is compatible with jq's.

Alternatives include:

using a library that is widely available, including Windows, so that for building a release one need only download the headers and DLL and it all just works (except now we'd have to distribute jq and the regexp DLL)
implement a plugin loading facility, using dlopen() on POSIX, LoadLibrary*() on Windows, ...

Having just done a release of jq, these issues are very much on my mind.

Thoughts?

nicowilliams commented 10 years ago

Also, since I do want regexp to be part of jq and don't want users to have to separately download a regexp DLL and plugin... I think the best option is to add a suitable regexp library to the source tree. If there's a suitable such library with a git repo, then maybe we could use git submodules.

wtlangford commented 10 years ago

Is there a particular flavor of regexp we are looking for? I'm working assuming we're looking to support PCRE-style regexes, but that could be wrong. And if we are, are we aiming to support things like PCRE modifiers (case-insensitive, extended mode (non-significant whitespace), etc)?

wtlangford commented 10 years ago

Of course, I can also just add support in directly for them, as well as for the other regexp styles, a la PHP's http://www.php.net/manual/en/function.mb-regex-set-options.php Maybe a match("/pattern/modifiers") and a match_ex("/pattern/modifiers","mode")

nicowilliams commented 10 years ago

I'm not able to search right now, but the issue has come up before and IIRC @stedolan wanted a PCRE. You can search the issues list yourself, or i will tomorrow.

pkoppstein commented 10 years ago

@wtlangford asked:

Is there a particular flavor of regexp we are looking for?

Since the "j" in "jq" ultimately refers to JavaScript, I believe that jq support for regular expressions should (at least in general) be guided by JavaScript. A useful reference is:

https://developer.mozilla.or/en-US/docs/Web/JavaScript/Guide/Regular_Expressions

If in doubt, I would suggest checking (and maybe following the example of) Ruby 2.1:

http://www.ruby-doc.org/core-2.1.1/Regexp.html

One really important point I'd like to make, however, is that it should be possible in jq to work with regular expressions and flags (aka modifiers) as data. That is, one should be able to read regular expression specifications and flags as data, and manipulate them as data. Since "data" in the jq context means JSON, it follows that jq should support JSON specifications of regular expressions and flags.

For example, suppose we want a jq function match(REGEX) that takes as input a STRING. The simplest approach that avoids new syntax would be for match() to expect the regex to be supplied as a string, e.g. "/abc/i" (or perhaps as a JSON array or object). Indeed, given the wealth of possibilities, I cannot really see the point in introducing additional syntax to handle regular expressions.

Of course, if JSON had a regex type, then we'd certainly use it, but it doesn't :-(

wtlangford commented 10 years ago

With respect to how to include the regex library, the one I'm currently using doesn't have a git repository, though that isn't hard to correct. It does, however, have a non-trivial build and I'm not sure how to work that into the current build. I've currently modified the build to link against a system version for testing purposes, though I would not add that to a pull request without discussing.

wtlangford commented 10 years ago

@nicowilliams: A search through the issues shows that @stedolan was considering using PCRE, so I'll use that syntax, which is nice, since the JavaScript regular expressions seem to have been based off of Perl's anyways. @pkoppstein I agree that the specification and flags should come in as data and not require a new syntax. I'm looking at ways to keep it simple, both in code and in usage. I can't decide if it seems more likely/useful that they come in as a single piece of data or as two, though. Thoughts on whether the function should be something like match("/abc/i") or match("/abc/"; "i") or match("abc"; "i") ? If we do one of the first two, then any backslashes in the actual regex pattern will need to be escaped and we will need to unescape them during processing. Of course, we could also define it to be match(["abc","i"]) Also: thoughts on a handful of regex-related functions, similar to javascript's exec, test, match, search? (see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions#Working_with_Regular_Expressions)

pkoppstein commented 10 years ago

@wtlangford contrasts:

match("/abc/i")
match("/abc/"; "i")
match("abc"; "i")
match(["abc","i"])

The key to resolving this issue is, I believe, that whereas jq requires that functions have a specific arity, it does support polymorphism. That is, in a sense, we can have our cake (match("abc")) and eat it (match(["abc", "i"])).

In fact, we could easily support: 5. match(STRING), match([STRING]), and match([STRING, STRING])

So unless @nicowilliams will allow us to have both match(STRING) and match(STRING;STRING), we can say:

(1) has the advantage of familiarity but at double the peskiness (because we have both '"' and '/' to trick everyone up) (2) and (3) don't fly.

((2) seems pointless anyway.)

Another (very tiny) advantage of (5) is that it could be further extended, e.g. to support {"regex": "abc", "flags": "i"}.

wtlangford commented 10 years ago

@pkoppstein I agree with your analysis. A polymorphic approach like that is very JavaScript and seems to be the best option that exposes additional functionality without too much clutter (since arity is strictly enforced). I'll work towards 5. match(STRING), match([STRING]), and match([STRING, STRING]) and will throw in the match(OBJECT) implementation as well. It won't take much additional work.

georgir commented 10 years ago

I just want to add that it would be best if the result from match() is very detailed, like an array of objects with offsets and lengths of all subpatterns, not just the matched strings. And if we support named subpatterns, then the result should probably be an object with separate object subkey for named subpatterns and array subkey for unnamed subpatterns, etc... From that and the brand new string slicing we can implement replace functions in jq itself... I guess... are string slices assignable?

wtlangford commented 10 years ago

@georgir said:

I just want to add that it would be best if the result from match() is very detailed, like an array of objects with offsets and lengths of all subpatterns, not just the matched strings. And if we support named subpatterns, then the result should probably be an object with separate object subkey for named subpatterns and array subkey for unnamed subpatterns, etc...

But for consistency at that point if no matches are found the result should be []. And then that wouldn't work inside of select(match("abc")), which is where I see this getting a lot of use. I like having the ability to return the matches, though. I'll likely add a search() builtin that does this, maybe returns an object like this:

{
    "matches" : [ {
        "offset" : #,
        "length" : #,
        "string" : "matchString"
    }, ... ],
    "captures" : [ "captureString", ... ],
    "namedCaptures" : { "name" : "captureString", ... }
}

nicowilliams commented 10 years ago

Multiple arity could be added: one def per arity. Adding bigger arity than six requires changes to parser.y; I would prefer not to go that way. Varargs for jq functions would necessitate new syntax because jq function arguments are lambdas, not values (e.g., $1 and so on for each lambda). Varargs for C-coded functions would be OK, but in general C-coded functions should be wrapped with jq ones, and anways, it's easy enough to pass/take arrays in those cases.

Bottom line: the easiest thing to do is to pass an array when you want multiple argument values rather than multiple lambdas, and always be careful not to accidentally cause cartesian products where no one expects them. But it's true that multiple arguments look better than arrays.

As to polymorphism, yes, but no automatic type conversions -- that's the key.

match(...) and match(..., options) and so on would be OK. But so would match(...) and match([..., options]). If you are up for contributing multiple arity support in builtin"c and compile.c, go for the first, else the second.

nicowilliams commented 10 years ago

@pkoppstein The 'j' in jq does not stand for JavaScript. Nowhere in the docs (nor source) is "jq" explained as standing for anything. But the docs talk a lot about JSON right on the homepage while nothing is said about JavaScript. The reader is subliminally invited to think of "jq" as being a "JSON query" language. But still, nothing is said about what "jq" stands for. Feel free to propose importing language features from any language, but not bad language features (of which JavaScript is full) :)

georgir commented 10 years ago

Will, captures and namedCaptures should have the structure of matches in your example, wanting offset for them and not just the strings is exactly what I ment originally. Perhaps me calling them subpatterns instead of captures was confusing. And I did overlook a regex with global-match flag, I was just thinking of single match with multiple captures. In global match mode, each full match result object should have its own captures/namedCaptures subkeys.

Whether you return an empty array or null or false for no match is fine by me... .[]? lets us iterate all possibilities without boilerplate checks, and length can be added to your select example. You might want to add a boolean test alternative to match though, if it will help performance to run with no captures and quit on the first match with the specific regex lib.

EDIT: Actually, a most jq-ish way would be to return empty on no match, one object on one match, and multiple objects on multiple matches for global-flagged regexes, as separate outputs rather than in an array.

wtlangford commented 10 years ago

I do like test as a name for that better. I'll probably implement it anyways, if nothing more than to make select with a regex simpler, though I will see if I can get the lib to stop early. Regarding the result format, that does make more sense; I misunderstood you the first time.

pkoppstein commented 10 years ago

@nicowilliams wrote:

The 'j' in jq does not stand for JavaScript.

True enough, but please note that I wrote: "the "j" in "jq" ultimately refers to JavaScript ..."; that is, the j in jq connotes "JSON", which is derived from "JavaScript Object Notation". In any case, jq expects its input to be valid JSON, which is derived from JavaScript. Since there is a decent standard for JavaScript, and since JS does have its Good Parts, it would in my opinion be better for jq to err on the side of consistency with these Good Parts rather than becoming YAJ (yet another jumble).

Btw, since JS does not (so far as I'm aware) provide boolean regex-match operators (often "=~" and "!~"), I would like to propose that those operators be included in the Grand Plan for jq.

pkoppstein commented 10 years ago

@wtlangford wrote (in connection with match):

but for consistency at that point if no matches are found the result should be []

Once again, one has only to turn to JavaScript (TGP) for illumination:

match    A String method that executes a search for a match in a string. 
         It returns an array of information or null on a mismatch.

nicowilliams commented 10 years ago

No, RFC7159 and its predecessors also aren't about JS. Really, JSON is independent of JS, and let's keep it that way.

wtlangford commented 10 years ago

So, not to belabor the topic, but I'd like to establish an interface we all agree that we like for regexes. After the discussions above, I'm not a fan of =~ or !~ just for the additional syntaxes they add (and the potential need to process the regexp string to separate mode flags).

I'd like to propose the following interface:

match(regex)         Runs the regex on its input and returns an array of matches
match(regex, mode)   Runs the regex on its input and returns an array of matches,
                     using the mode flags given
test(regex)          Runs the regex on its input and returns true if a match is found
test(regex, mode)    Runs the regex on its input and returns true if a match is found,
                     using the mode flags given

The return value from match will look something like this:

[ {
    "offset" : #,
    "length" : #,
    "string" : "matchString"
    "captures" : [ {
        "name" : "name", // For non-named groups, is this missing or empty or null or a number?
        "offset" : #,
        "length" : #,
        "string" : "unnamedCapture1"
    }, ... ]
}, ... ]

Which is an array of objects representing matches of the entire regexp. For a successful match without the global modifier, there will be only one in the array. If the global modifier was given, it will contain all non-overlapping matches, in the order they were found. All captures have been placed into the same array, since named captures still take up backreference numbers for the non-named captures. /(abc) (?<name>def) (ghi)/ has 3 capture groups, 1, 2 (which is also name) and 3.

If match fails, do we want to return null or false or [] ?

Thoughts?

nicowilliams commented 10 years ago

@wtlangford I like that interface design. The only alternative I can think of would be:

match(regexp_string)
match([regexp_string, mode)
test(regexp_string)
...

As for match's outputs, ideally it should be a generator, generating one item per match instead of an array. The C-coded interface should be _match() and it should output just an array (no match -> []), but its jq wrapper should be def match(re): _match(re; null)|.[] and def match(re; mode): _match(re; mode)|.[]. I'm open to arguments to the contrary, of course.

wtlangford commented 10 years ago

Is that meant to be match([regexp_string, mode]) ? I played with that, but it made the checks and unpacking the values look really ugly.

nicowilliams commented 10 years ago

@wtlangford Oops, yes. And you're right, make it your original proposal, + my answer as to outputs.

wtlangford commented 10 years ago

Done. I'll wrap that up in a bit. Next question is how do we want to handle building and distribution, now that there's an external dependency?

nicowilliams commented 10 years ago

Did you get multiple arity patched in? Good. As for build and release issues:

for now we could just integrate, kick that can down the road (but this must be resolved before a 1.5 release, and we should be confident that we can do so)
if you can, try cross-compiling for Windows with mingw (http://mingw-w64.sourceforge.net/download.php; download, unzip, add the bindir to PATH when running scripts/crosscompile; git grep crosscompile to find out more)
if you can, test on Windows
does the library you chose build on Windows? are there pre-built DLLs available for download?
if need be we can exclude regexp where the regexp library is not available; tough beans
if the license is friendly enough we can include a copy of the library in jq's git repo itself, much like Heimdal (https://github.com/heimdal/heimdal) does for things like SQLite3 and libedit

nicowilliams commented 10 years ago

@wtlangford Another option regarding mode is to make it a required argument; I wouldn't mind.

wtlangford commented 10 years ago

@nicowilliams Not sure I'm going to be able to get multiple arity support with my current understanding of jq's internals. I can't quite figure out where that needs to be done.

nicowilliams commented 10 years ago

@wtlangford

Making the mode argument required is the expedient solution. Making multiple arities work would require changing jq_parse*() and block_bind_referenced() and surrounding code support discriminating blocks for functions with the same names but different argument list sizes. It might not be difficult, but it does require swapping in a bunch of state.

wtlangford commented 10 years ago

That's where I was looking. I'll continue to poke at how that works. In the meantime, I'll just make the mode argument required. Working out a few UTF-8 related bugs and then it should be time to write tests for it and look at windows deployment.

pkoppstein commented 10 years ago

@wtlangford wrote:

I'll just make the mode argument required.

I hope you are only referring to some helper function rather than "match" itself. That is, I hope you are not abandoning the trio: match(STRING), match([STRING]), and match([STRING, STRING]).

By the way, when I wrote about introducing infix operators such as =~ and !~, I did NOT mean to imply support for the "/abc/i" style of regex. On the contrary, I had in mind exactly the same polymorphism we've already discussed, e.g.

select( . =~ [ "[abc]", "i"] )

nicowilliams commented 10 years ago

@wtlangford Multiple arities should be doable in block_bind_subblock(), but it requires code to count the expected arguments of each candidate function found whose block could be used for the binding.

nicowilliams commented 10 years ago

@wtlangford Yes, that's where we'd make multiple arities work. All that has to be done is check that the arg count from the caller (body) and function (binder) match. However, we'll likely end up with a situation where incorrect argument count causes "error: is not defined" -- we'd have mark up argument list mismatches in this same function (block_bind_subblock()) so that we can signal the correct error later on. It's not quite that trivial an improvement, but I think it'd be a great improvement.

nicowilliams commented 10 years ago

@wtlangford We'd have to detect argument number mismatch in block_bind_subblock() and record it in the body's block, allowing some other binder to get bound in a subsequent call. Then modify expand_call_arglist() to make use of that information to produce the correct error. It's probably easy, so I may get to it soon.

wtlangford commented 10 years ago

Playing with valgrind, the library I'm using (oniguruma) leaks... ~86K of memory when using case-insensitive mode. At a guess, it seems to be some state information used for case insensitivity in UTF-8 mode. I can patch together a fix for that, but then we'll have to look into a way to use our patched version of oniguruma in builds. Oniguruma uses the BSD license.

Thoughts?

wtlangford commented 10 years ago

@pkoppstein Doing the polymorphic version made the interface feel a bit weird, not to mention the code to extract the values from the inputs was unpleasant. Once multiple arities gets implemented, it might be nice to look into it again, but I think I like the match("regex"; "modifiers") style best. Thoughts, @nicowilliams?

nicowilliams commented 10 years ago

Regarding the leak, of it's one-time it's OK, otherwise it has to get patched.

nicowilliams commented 10 years ago

Don't count on multiple arities just yet...

wtlangford commented 10 years ago

I chained two match() calls together and the leaked amount didn't increase, so it seems to only be one time, and even then, only if the match is in case-insensitive mode. I wasn't counting on the multiple arities existing. I was just saying I'd rather have the match("regex"; "modifier") implementation than the weird polymorphic one, if we're restricted to only one. When/If multiple arities occurs, I'd put the polymorphism in the single-argument flavor.

nicowilliams commented 10 years ago

valgrind will should tell you where the leak came from, and you can then check that library's docs/source to see if that's expected. (A lot of libraries leak one-time at load-time. Can't begrudge them. It's OK. It's even OK if you just show that it's one-time without inspecting the library.)

You will have to setup tests/run to use the --suppressions=<filename> option to valgrind so as to pass the tests.

pkoppstein commented 10 years ago

@wtlangford wrote;

@pkoppstein Doing the polymorphic version made the interface feel a bit weird, not to mention the code to extract the values from the inputs was unpleasant ... Thoughts, @nicowilliams?

Why not define the C functions in the simplest possible way and then provide the user-oriented polymorphic functions as jq functions (at the bottom of builtin.c)? Polymorphism is trivial there, as in:

def match(a): if type == "string" then match_string(a)
  else if type == "array" then match_array(a)
          else error ("whoops")
  end
end;

nicowilliams commented 10 years ago

@pkoppstein Yes, that's what I had in mind; perhaps @wtlangford is still not happy with it for some reason?

BTW, back to multiple arities, today when you define a function twice, the second is used for all references to it defined after it, but the first is used for all references in between the first and the second define. Adding multiple arities might cause some suprisies. I wonder if it's too late for multiple arities, if maybe we need a new syntax to permit them.

jqlang / jq

Feature request: regex (or minimally, startswith/endswith) string tests #164