mudge / re2

Ruby bindings to RE2, a "fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python".
http://mudge.name/re2/
BSD 3-Clause "New" or "Revised" License
129 stars 13 forks source link

Inconsistent scanning results compared to Ruby #23

Closed ljharb closed 9 years ago

ljharb commented 9 years ago

If I do 'abca'.scan(/a/).to_a I get ["a", "a"] which is what I expect. However, if I do RE2::Regexp.new('a').scan('abca').to_a, I get [[], []].

Are my expectations wrong here? Or is this a bug?

mudge commented 9 years ago

Hi Jordan,

RE2's Scanner has a slightly different interface to Ruby's String#scan: in order to capture any matches, you need to use capturing groups in your regular expression:

RE2('(a)').scan('abca').to_a
#=> [["a"], ["a"]]
RE2('(ab?)').scan('abca').to_a
#=> [["ab"], ["a"]]

This is because the Scanner actually wrap's re2's FindAndConsumeN under the hood.

ljharb commented 9 years ago

@mudge ok - so then, how can I use re2 to replicate the .gsub interface, which return an enumerator, or take a block, or take a hash with replacements?

ljharb commented 9 years ago

ie, there appears to be no way with re2 to enumerate all of the things that RE2.GlobalReplace would replace, only the explicit capturing groups.

mudge commented 9 years ago

Unfortunately, I can't find an obvious analogue to Ruby's String#gsub when given a block in re2's API.

The "Scanning text incrementally" section only covers Consume and FindAndConsume (which is implemented as the Scanner in this gem) and the only replacement options seem to be Replace and GlobalReplace which operate on the whole input in one go.

Maybe we could find an alternative based on your use case? Do you need to do an incremental replacement on a large input?

ljharb commented 9 years ago

What does RE.GlobalReplace use under the hood with my RE2::RegExp to locate matches for replacement? Could that be exposed at all?

I think that would be sufficient for me to implement all of hash-based, block-based, and enumerator-based substitution.

mudge commented 9 years ago

GlobalReplace just uses the underlying re2 library's RE2::GlobalReplace function. The underlying C++ API doesn't yield matches in any way: it just performs the replacement internally.

However, looking at the source shows that it is just using Match and Rewrite internally so perhaps there is a way to piece this together?

ljharb commented 9 years ago

That would be awesome if there is a way :-) C++ isn't my strong suit tho, unfortunately