paleolimbot / narrow

An R interface to the 'Apache Arrow' C API
https://paleolimbot.github.io/narrow/
Other
30 stars 3 forks source link

Could the C API be exported ? #5

Open eddelbuettel opened 2 years ago

eddelbuettel commented 2 years ago

narrow already automates creating the needed instantiations via src/init.c. That registers the object code.

How do you think about going one step further and provide a file, say, narrowAPI.h with a bunch of declarations such as (untested, adapting from prior work)

SEXP attribute_hidden narrow_c_array_info(SEXP x) {
  static SEXP(*fun)(SEXP) = (SEXP(*)(SEXP)) R_GetCCallable("narrow","narrow_c_array_info");
  return fun(x);
}

and so on.

A number of packages do this (e.g. nloptr, I have two small packages exporting some of R's own (unexported) API functions (via copies in the package), also added it in small scope other packages). The canonical example (also in WRE) is always lme4 and Matrix.

Would you have any interest in supporting such use?

paleolimbot commented 2 years ago

There is already one that provides the utilities to get a struct ArrowSchema, struct ArrowArray, and structArrowArrayStream. For example:

https://github.com/paleolimbot/narrow/blob/8ff21dd002107eb092ff6068a0e63254e4486e65/inst/include/narrow.h#L102-L117

Used in an example: https://paleolimbot.github.io/geoarrow/articles/development.html#points

Exposing narrow_c_array_info() seems a bit odd to me because it returns a list which is harder to deal with than the C structs.

Which bits are you interested in having exported?

eddelbuettel commented 2 years ago

Thanks for the very prompt reply.

The C API of Arrow is well understood and widely used. It seemingly motivated this package too. Being able to handle Arrow data structures without having to bring in the (full) Arrow infrastructure can be beneficial. You argue that same point well in your draft vignette.

Beyond that C API, narrow contains a number of useful helper functions at the C level that are already registered in init.c meaning R 'knows them' when importing narrow. There is zero cost in providing an export header. It would permit other R package access without having to go through the R interface. No more, no less.

The alternative is to copy the code out, or "vendor" it. Which is what I just did for a proof of concept where I copied about fifty lines (to allocate array and schema objects and set up external pointer objects and finalizers). There are multiple possible views about whether vendoring is better or worse; I didn't really come here to debate this. I simply came to see whether you'd be amenable to (also) support (C level) use of the package.

Also, the interest is not in narrow_c_array_info(). That happens to be the second function listed in init.c, and a 'cheaper' one with one arg than the one that preceeds it in the pole position so I used it for the illustration above.

This really is all about a header with would affect only packages using LinkingTo: as the use of C level routines from other packages can be pretty useful (especially given generally stable interfaces as we have here).

paleolimbot commented 2 years ago

I see!

For the forseeable future, if there are C bits you would like to use, you will have to copy them. This is because I need the freedom to change them without breaking others' code. While the Arrow C API is stable, I wrote the R bits quickly and would like to rewrite them.

With respect to allocating structures and attaching finalizers, the preferred approach would be to use narrow:::narrow_allocate_schema() (which is currently unexported but should be exported). This allocates a struct ArrowSchema and attaches a finalizer (you can then pass the pointed-to object into your own compiled code and set the release callback as appropriate). It sounds like this is your main use-case?

eddelbuettel commented 2 years ago

Yeah, the 'once exported it's frozen part' hit me to once I typed that. I posted nloptr as an example, but it clearly is more API stable in its re-exporting of NLopt.

Yes, and what you describe is/was my use case, and was the 'first version' I wrote (with a few lines in R and a narrower call 'down'). I now simply wrote an alternate that does more at the C(++) level for which these routines came in handy. Especially given how a number of your R routines are (at least currently) really just wrappers to the .Call().

And agreed that :: would better for narrow_allocate_schema().

paleolimbot commented 2 years ago

Done! A few of those might change before a CRAN release but should be stable afterward. (And probably not narrow_allocate_schema()).

eddelbuettel commented 2 years ago

So for kicks I just set up a quick narrowABI.h to export some first five narrow_c_* functions, registered them as callable in src/init.c and can use that from a client package. Cleaner than copying code over.

eddelbuettel commented 2 years ago

That was completed recently. No other changes were made to to narrow but to

Now other packages can use the C level functions. This helpful for other packages aiming to test Arrow data structures. You can see the difference by comparing from the fork back here (there are trivial changes to .Rbuildignore, .gitignore and DESCRIPTION you can consider as 'local').

Would you entertain a PR on the changes to src/init.c and inst/include/narrowAPI.h ?

Thanks again for putting narrow together. It has been quite helpful.

paleolimbot commented 2 years ago

I appreciate your enthusiasm for the C functions that interface with R objects. I'm not necessarily against this proposal, but narrow is too early in the development process to entertain any promise of a stable ABI. Perhaps after a few initial versions of this package have been released on CRAN would be a good time.

In the meantime, I believe all the wrappers that you have implemented in narrowABI.h could be implemented via an Rcpp function call to R, since they all have corresponding R functions. If users complain about overhead, perhaps that is the right time to consider specific use cases and if exporting a specific function at the C level would be a good fit for narrow (or whether it would be a good fit for an extension package).

eddelbuettel commented 2 years ago

It may be that I explained myself poorly here. The mechanism I am referring to is the one described in Section 5.4.3 ("Linking to native routines in other packages") of Writing R Extensions (link to CRAN and link to same page rendered by RStudio). This is not a call from R, but directly at the C level. It is useful in testing packages with Arrow data structures using the C level interface.

So I might just fork the package, keeping of course all credits to your work as well as keeping the original license -- but adapting it as suggested here to be able to deploy this in package tests. I have that already set up (currently in a branch of another package) using Additional_repositories: and Suggests; it will be even easier if I just upload the fork to CRAN. All this will not constrain you in any way. I understand your focus may be on other uses, geospatial or other.

But the package is quite useful as it is, so I may just proceed and actually make even more use of it.

paleolimbot commented 2 years ago

I think I understand the mechanism used by narrowABI.h - I mentioned the Rcpp::Function mechanism because it would be a way to access the functions you are looking for from compiled code. I don't have all the details about your use-case but it seems like you're running tests in C++? If that's the case, I imagine calling the functions you are looking for using Rcpp::Function (or a wrapper around it) would have minimal overhead and let you create the test fixtures you are after from C++.

You're also welcome to copy the bits you find useful and expose an ABI from a different package, although the as_narrow_*() functions and the narrow_schema/array/array_data/array_stream S3 classes I think will serve the R users best living in one package.

Next week I am planning to spend a few days on 'narrow' to incorporate some of the things I've learned about Arrow since I wrote it. I think that the use-case you are working on and 'narrow' will both be better for being patient and making sure the details are tight before exposing the lower-level details as an API for others to use.

eddelbuettel commented 2 years ago

(You keep bringing up Rcpp, which is very kind of you, but it really has nothing to do with Rcpp. This inquiry is about dealing with Arrow buffers (at the C level) in another R package. For which the nice code you already wrote comes in handy so I am currently using it.)

I can only politely reiterate my offer to add the file narrowABI.h exporting only what you already have (and as a C-level export clearly more 'buyer beware' than as an R function, we can stick as bold an 'unsupported' on there too as we deem appropriate), along with the instantiation block in src/init.c. Apart from the few microseconds at package load to instantiate via the R_RegisterCCallable() calls it changes nothing in your package, but adds functionality. I find it useful, other people may too. I fully understand you may wish to limit your maintenance burden but it seems to me it only completes existing code and functionality in the package.