Closed mudge closed 9 years ago
There's a change missing at around 1050: https://github.com/mudge/re2/pull/21/files#diff-e23c54ff6a986f6e4e97f67de7344d98R1122
Basically, the text string that gets stored in the structure needs to be the same one that gets given to RE2. RE2 doesn't return new strings and just returns offsets and lengths within the string passed to it.
With this patch, as is, RE2 receives a different string to the one we store - thus the offsets will include an arbitrary offset equivalent to the distance in memory between the two strings.
Ah, of course; great spot and apologies for missing that.
I've amended the commit so this looks good to go. I'll probably collapse this into one commit (to add begin
and end
and its specs in one go), check the generated documentation and note your contribution in the README if that's OK?
That's perfect, thanks. I haven't had much time to look at the rb_str_len thing but it should be fine as it is for now and maybe I'll get to it at some point, or someone else will I'm sure!
Frustratingly, looks like rb_str_length
is only available in Ruby 2+; using the more compatible RSTRING_LEN
seems to return the byte size again (presumably it is unaware of encoding).
We also need to add the magic comment # encoding: utf-8
to spec/re2/match_data_spec.rb
for Ruby 1.9 (which will throw an exception due to the multibyte character otherwise).
Any other ideas re getting the string length more consistently? Perhaps we could call out to length
rather than using a C function directly?
There's some advice in James Edward Gray's "Bytes and Characters in Ruby 1.8" which we might be able to use: namely, using "some string".scan(/./mu).size
to get the number of characters in a string across Ruby versions.
As this is probably best done in pure Ruby, I'm tempted to implement begin
and end
in Ruby by returning the offset strings from the C API and then scan
ning them.
Sorry I meant looking at rb_str_sublen
(not rb_str_len
- that doesn't actually exist.)
At the moment, begin() and end() actually allocate a Ruby string that contains the characters from BEFORE the match. It then uses rb_str_length
to do a multibyte-aware count.
Using rb_str_sublen
would allow us to do a multibyte-aware count without allocating a Ruby object, saving that allocation since we're only returning an integer and it will garbage immediately.
Now, I was pretty sure both of those are available in Ruby 1.9, could be wrong though. Could you push the .rb encoding shebang and see what the Travis says?
Regarding the rb_str_sublen
and rb_str_length
- both should be reliable (assuming available in the version of Ruby.) Just the former would avoid the object allocation and immediate garbage.
Just pushed a passing build with begin
and end
in Ruby (but a new until_begin
and until_end
in C): the magic comment fixed 1.9 but Ruby Enterprise Edition and Ruby 1.8 both lack rb_str_length
and RSTRING_LEN
just returns the number of bytes (looks like the former was introduced in 1.9: http://rxr.whitequark.org/mri/ident?v=1.8.7-p374&i=rb_str_sublen).
Even if 1.8 had rb_str_sublen
, I suspect we'd have to use the Regexp trick as string size was simply the number of bytes rather than characters prior to 1.9.
I'd rather keep compatibility with 1.8 if we can but we could drop it if need be, what do you think?
Actually, perhaps it wouldn't be a bad idea to use rb_str_sublen
if available (much like the ENCODED_STR_NEW
macro) and fall back to RSTRING_LEN
for encoding-unaware Rubies. I could make an argument that the offset should be by bytes for Ruby 1.8 and by character for >= 1.9 (so that string[offset..-1]
behaves consistently).
Yes I agree with that. For Ruby 1.8 if length is byte then begin() end() should return byte.
Almost there: looks like Rubinius lacks rb_str_sublen
despite setting HAVE_RUBY_ENCODING_H
. It might possible to put the missing implementation in if the rest of the string functions are in place. Otherwise, please feel free to take a look at the changes and let me know what you think.
LGTM, thanks! Good to see the rb_str_sublen working. Must have been something wrong/odd in my original working copy that broke it for me.
On Rubinius, not sure. I guess worst case it does what my original did which is allocate a ruby string, count it and garbage it. But if rb_str_sublen can be reconstituted that would work too. Maybe worth seeing if such code should be pushed to Rubinius too.
Finally! All passing. I've squashed the history into something more linear, updated the README, merged into master and pushed 0.7.0.
Thanks for being so patient.
An updated version of @driskell's #21 (rebased against 0.6.1).
Unfortunately, rebasing seems to have broken the
begin
andend
functionality: both are correct relative to each other but seem to contain random junk which is throwing off their numbers. It smells like a pointer arithmetic problem, any chance you could take a look, @driskell?