mudge / re2

Ruby bindings to RE2, a "fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python".
http://mudge.name/re2/
BSD 3-Clause "New" or "Revised" License
130 stars 13 forks source link

Implement begin and end methods on the MatchData #22

Closed mudge closed 9 years ago

mudge commented 9 years ago

An updated version of @driskell's #21 (rebased against 0.6.1).

Unfortunately, rebasing seems to have broken the begin and end functionality: both are correct relative to each other but seem to contain random junk which is throwing off their numbers. It smells like a pointer arithmetic problem, any chance you could take a look, @driskell?

driskell commented 9 years ago

There's a change missing at around 1050: https://github.com/mudge/re2/pull/21/files#diff-e23c54ff6a986f6e4e97f67de7344d98R1122

Basically, the text string that gets stored in the structure needs to be the same one that gets given to RE2. RE2 doesn't return new strings and just returns offsets and lengths within the string passed to it.

With this patch, as is, RE2 receives a different string to the one we store - thus the offsets will include an arbitrary offset equivalent to the distance in memory between the two strings.

mudge commented 9 years ago

Ah, of course; great spot and apologies for missing that.

I've amended the commit so this looks good to go. I'll probably collapse this into one commit (to add begin and end and its specs in one go), check the generated documentation and note your contribution in the README if that's OK?

driskell commented 9 years ago

That's perfect, thanks. I haven't had much time to look at the rb_str_len thing but it should be fine as it is for now and maybe I'll get to it at some point, or someone else will I'm sure!

mudge commented 9 years ago

Frustratingly, looks like rb_str_length is only available in Ruby 2+; using the more compatible RSTRING_LEN seems to return the byte size again (presumably it is unaware of encoding).

We also need to add the magic comment # encoding: utf-8 to spec/re2/match_data_spec.rb for Ruby 1.9 (which will throw an exception due to the multibyte character otherwise).

Any other ideas re getting the string length more consistently? Perhaps we could call out to length rather than using a C function directly?

mudge commented 9 years ago

There's some advice in James Edward Gray's "Bytes and Characters in Ruby 1.8" which we might be able to use: namely, using "some string".scan(/./mu).size to get the number of characters in a string across Ruby versions.

As this is probably best done in pure Ruby, I'm tempted to implement begin and end in Ruby by returning the offset strings from the C API and then scanning them.

driskell commented 9 years ago

Sorry I meant looking at rb_str_sublen (not rb_str_len - that doesn't actually exist.)

At the moment, begin() and end() actually allocate a Ruby string that contains the characters from BEFORE the match. It then uses rb_str_length to do a multibyte-aware count.

Using rb_str_sublen would allow us to do a multibyte-aware count without allocating a Ruby object, saving that allocation since we're only returning an integer and it will garbage immediately.

Now, I was pretty sure both of those are available in Ruby 1.9, could be wrong though. Could you push the .rb encoding shebang and see what the Travis says?

driskell commented 9 years ago

Regarding the rb_str_sublen and rb_str_length - both should be reliable (assuming available in the version of Ruby.) Just the former would avoid the object allocation and immediate garbage.

mudge commented 9 years ago

Just pushed a passing build with begin and end in Ruby (but a new until_begin and until_end in C): the magic comment fixed 1.9 but Ruby Enterprise Edition and Ruby 1.8 both lack rb_str_length and RSTRING_LEN just returns the number of bytes (looks like the former was introduced in 1.9: http://rxr.whitequark.org/mri/ident?v=1.8.7-p374&i=rb_str_sublen).

Even if 1.8 had rb_str_sublen, I suspect we'd have to use the Regexp trick as string size was simply the number of bytes rather than characters prior to 1.9.

I'd rather keep compatibility with 1.8 if we can but we could drop it if need be, what do you think?

mudge commented 9 years ago

Actually, perhaps it wouldn't be a bad idea to use rb_str_sublen if available (much like the ENCODED_STR_NEW macro) and fall back to RSTRING_LEN for encoding-unaware Rubies. I could make an argument that the offset should be by bytes for Ruby 1.8 and by character for >= 1.9 (so that string[offset..-1] behaves consistently).

driskell commented 9 years ago

Yes I agree with that. For Ruby 1.8 if length is byte then begin() end() should return byte.

mudge commented 9 years ago

Almost there: looks like Rubinius lacks rb_str_sublen despite setting HAVE_RUBY_ENCODING_H. It might possible to put the missing implementation in if the rest of the string functions are in place. Otherwise, please feel free to take a look at the changes and let me know what you think.

driskell commented 9 years ago

LGTM, thanks! Good to see the rb_str_sublen working. Must have been something wrong/odd in my original working copy that broke it for me.

On Rubinius, not sure. I guess worst case it does what my original did which is allocate a ruby string, count it and garbage it. But if rb_str_sublen can be reconstituted that would work too. Maybe worth seeing if such code should be pushed to Rubinius too.

mudge commented 9 years ago

Finally! All passing. I've squashed the history into something more linear, updated the README, merged into master and pushed 0.7.0.

Thanks for being so patient.