Closed DavisVaughan closed 1 year ago
I think Lionel and Kevin convinced me that R doesn't actually protect the global string pool (because that could lead to strings never being released), so this won't work. It does protect symbols, like Rf_install()
Right now,
r_string
has a private data member calledsexp data_
, which holds the CHARSXP it wraps https://github.com/r-lib/cpp11/blob/2ddc12f6ab268542c88f4aa24a500682ed260b72/inst/include/cpp11/r_string.hpp#L50I propose that we consider changing that from
sexp
toSEXP
. The benefit ofsexp
is that it provides automatic protection of that object, but it doesn't need it! CHARSXPs stored in R's global string pool are automatically protected for the life of the R session, so protecting them again is extraneous, and more importantly, pretty expensive.There are two obvious and extremely common cases that would have significant performance improvements if we did this:
cpp11::strings
throughx[i]
, which creates and returns anr_string
(in this particular case, thedata_
is also protected byx
)cpp11::writable::strings
throughx[i] =
, for example, when assigning astd::string
, which has to be converted tor_string
first.The change would involve moving from
sexp
toSEXP
here: https://github.com/r-lib/cpp11/blob/2ddc12f6ab268542c88f4aa24a500682ed260b72/inst/include/cpp11/r_string.hpp#L50And then making the
==
comparison operators comparedata_
rather thandata_.data()
here: https://github.com/r-lib/cpp11/blob/2ddc12f6ab268542c88f4aa24a500682ed260b72/inst/include/cpp11/r_string.hpp#L35-L37These changes in combination with #299 would have made cpp11 fast enough that I wouldn't have needed to drop down to the R API here https://github.com/EmilHvitfeldt/cpp11ngram/pull/1/files#diff-aa381cabaf80f28b6f14233769693f3faf2d817395969f33e64e014284a51130R20-R33
Performance examples follow below
For element extraction, let's look at using the raw R API to see how fast it could be:
Then with cpp11:
This is slow because
x[i]
doesSTRING_ELT()
to get theSEXP
, but then casts that tor_string
, which wraps theSEXP
into asexp
, forcing it to be protected.Now that example again, but with
sexp
swapped forSEXP
, avoiding the extra protection:Just as fast as the raw R API!
Now let's look at element assignment. We are going to assign the string
"foo"
1 million times to an output vector.With the raw R API:
Now with a naive cpp11 approach:
Wow, painfully slow. Now, the biggest issue here is actually that
unwind_protect()
is running twice at each loop iteration:unwind_protect()
s throughsafe[Rf_mkCharLenCE]
when turningx
into aCHARSXP
then to ar_string
https://github.com/r-lib/cpp11/blob/2ddc12f6ab268542c88f4aa24a500682ed260b72/inst/include/cpp11/r_string.hpp#L19-L20unwind_protect()
s through the proxy=
assignment when callingSET_STRING_ELT()
https://github.com/r-lib/cpp11/blob/2ddc12f6ab268542c88f4aa24a500682ed260b72/inst/include/cpp11/strings.hpp#L55-L60So let's assume we know we can wrap the loop in
unwind_protect()
to limit the number of unwind-protects, AND let's assume we have #299 installed so that technique actually works.Better, but still 4x slower than base R.
Again, the problem here is that
x
goes fromstd::string
to CHARSXP tor_string
, and when it gets put in ther_string
it gets wrapped insexp
and is "extra" protected.Now that
unwind_protect()
example again, but withsexp
swapped forSEXP
, avoiding the extra protection:Again, essentially as fast as the R API!