markvanderloo / stringdist

String distance functions for R
318 stars 36 forks source link

Exposing the pkg C API #67

Closed ChrisMuir closed 6 years ago

ChrisMuir commented 6 years ago

Have you given any thought to exposing some of the internal functions that do the heavy lifting? I ask because I maintain a package that uses stringdistmatrix() like so:

# Create a stringdistmatrix for every element of initial_clust.
distmatrices <- lapply(initial_clust, function(x) {
  as_matrix(stringdistmatrix(x, weight = weight, ...))
})

Obj initial_clust is a list of char vectors of various lengths, and when input is large this code chunk becomes a bottleneck. Just for testing purposes, I tried swapping in stringdist:::lower_tri() and saw a 23% speed boost, and I don't need to push all of the inputs through all of the validations at the top of stringdistmatrix() every time it's called.

I can't use ::: to import lower_tri() because CRAN will complain. If it were just a single C function I could simply copy the function (with credit to the authors obv), but I'm currently giving users the option to choose any of the ten methods via ellipses.

I'd love to continue importing stringdist and be able to use lower_tri().

ChrisMuir commented 6 years ago

An alternative (and maybe better solution) could be to create header files in inst/include that provide access to the stringdist C functions, and would allow for direct integration by another package using LinkingTo in the DESCRIPTION file of the other package. This would give package authors access to the stringdist C functions without exporting any additional R functions. Taking this approach, I could use the C function R_lower_tri() directly, and write my own wrapper for use in my pkg functions.

Check out this SO post, and the answers by Romain and Dirk for more info.

And per Dirk's answer in that thread, here's an example in the wild, rcppxts linking to and using C functions from the xts pkg.

If you're open to this idea, I'd be happy to work on implementation. Let me know if there's any interest. Thanks!

markvanderloo commented 6 years ago

Yeah exposing the C API is definitely on my list.

ChrisMuir commented 6 years ago

Cool! I can start working on implenenting changes next week, and can submit a PR. Let me know if you're okay with that, or if you have any thoughts/issues/concerns. Thanks Mark!

markvanderloo commented 6 years ago

That would be great! Couple of things/ideas/pointers

Using C99 and avoiding C++ may seem a bit spartan to some but my philosophy is that we're writing a library, and not an application. For a library, stability, reliability and portability are the most important features, so investing in clean and simple code with no dependencies is worth the effort.

I also have a doubt:

I assume that many users of linkingto will be Rcpp users so we could gear it towards that kind of use I suppose (while avoiding C++ ourselves)

ChrisMuir commented 6 years ago

Hi Mark,

Thanks for the follow up and feedback. I actually worked on this some over the weekend, and last night pushed an initial first pass at making this work, check out the diffs here. I also created a new branch in my package refinr that is LinkingTo stringdist and is using two of your C functions. Couple of things to mention (and keep in mind, this was just meant to be a rough proof concept to get the ball rolling):

To briefly map out how refinr is using the stringdist C API, see the last two functions in my utils.R file. Func lower_tri() is calling sd_lower_tri(). Func get_list_lengths() is calling a C function I added to refinr in file refinr/src/stringdist.c, which is calling sd_lengths(). Both of the sd_ functions are pointing back to your C functions (the R_ versions) via file stringdist/pkg/inst/include/stringdist_api.h. I'm not using get_list_lengths() in my pkg, but added the function just to test, and it's working:

refinr:::get_list_lengths(list(c(1, 2, 3), c("cats", "dogs")))
#> [1] 3 2

I worked on these edits on a PC, and everything works so far. I plan to test it all on a Mac today. I also set up my forked stringdist repo on Travis, and the c_api branch passed, so that's a good sign.

ChrisMuir commented 6 years ago

Quick update, just tested on a Mac, installing the c_api branch of both ChrisMuir/stringdist/pkg and refinr works fine, and the edited functions in refinr work as expected.

Also, regarding the whole Imports issue for refinr, if I include @import stringdist in my pkg man docs, I can then remove stringdist from the Imports field of the DESCRIPTION file....here's the commit with those edits. I'm not really sure if it's best to set it up this way, or to leave stringdist on the Imports list.

ChrisMuir commented 6 years ago

I think we can close this issue now.....C API has been built out and is exposed as of version 0.9.5.1.