qsbase / qs

Quick serialization of R objects
400 stars 19 forks source link

ALTREP serialization and deserialization methods are ignored #87

Closed david-cortes closed 9 months ago

david-cortes commented 10 months ago

If one defines an ALTREP'd object for which there are serialization and unserialization methods (R_set_altrep_Serialized_state_method and R_set_altrep_Unserialize_method), but no DATAPTR method, qs will throw an error:

Error in qsave(rr, "/home/david/temp.qs") : 
  cannot access data pointer for this ALTVEC object [class: altrepped_ptr, pkg: somepkg]

In contrast, saveRDS and readRDS will both correctly serialize and deserialize the object.

Furthermore, if the ALTREP'd object does define DATAPTR, then qs will save whatever is in that DATAPTR but will not call the serialization or de-serialization methods, resulting in qread potentially returning an invalid object.

This usage of ALTREP is particularly useful when defining external pointers to non-R objects that have custom serialization methods (e.g. statistical models where the result is a C++ object which is not directly translatable to R objects, but which can be interfaced through externalptr).

traversc commented 10 months ago

Hi, you are 100% correct. Last time I checked, there was no public C API to make generic ALTREP serialization work with qs. However, that may have changed and I will look into it again.

I am a happily surprised that someone has made use out of alt-rep serialization. If you could point me to a reproducible example, that would help.

david-cortes commented 10 months ago

Hi, you are 100% correct. Last time I checked, there was no public C API to make generic ALTREP serialization work with qs. However, that may have changed and I will look into it again.

I am a happily surprised that someone has made use out of alt-rep serialization. If you could point me to a reproducible example, that would help.

Here I've created a small example: https://github.com/david-cortes/altreppedpointer

To try it out, install with:

remotes::install_github("david-cortes/altreppedpointer")

Then generate an object that won't be qs-serializable as follows:

library(altreppedpointer)
obj <- get_custom_object()

library(qs)
fname <- file.path(tempdir(), "custom_obj.qs")
qsave(obj, fname)
obj2 <- qread(fname)

And compare with:

library(altreppedpointer)
obj0 <- get_custom_object()

fname2 <- file.path(tempdir(), "custom_obj.Rds")
saveRDS(obj0, fname2)
obj3 <- readRDS(fname2)

If you check obj3 (from readRDS), it will have a non-null externalptr object, and will print a message when loading it, while obj2 will have null pointer and won't print any message when loading it.

If you modify the source code to not have an R_set_altlist_Elt_method call, it will lead to an error in qsave too.

traversc commented 10 months ago

Thanks for the example! Looking into this again:

Ideally, I'd like to get access to the serialized state directly but as far as I can tell there's now way to access it.

In R internally, that's done by calling SEXP state = ALTREP_SERIALIZED_STATE(s); which is not an exported function.

One other potential approach is if there is someway to tell if the serialize and unserialize ALTREP functions are defined for an ALTREP class. Then I could just call R_Serialize and treat the result like a black box data blob. However, I don't think there's a way to do this either.

Let me know your thoughts.

david-cortes commented 10 months ago

Thanks for the example! Looking into this again:

Ideally, I'd like to get access to the serialized state directly but as far as I can tell there's now way to access it.

In R internally, that's done by calling SEXP state = ALTREP_SERIALIZED_STATE(s); which is not an exported function.

One other potential approach is if there is someway to tell if the serialize and unserialize ALTREP functions are defined for an ALTREP class. Then I could just call R_Serialize and treat the result like a black box data blob. However, I don't think there's a way to do this either.

Let me know your thoughts.

I haven't looked very deeply into it, but following up on that ALTREP_SERIALIZED_STATE, it seems to be a long chain of macros which end up calling STDVEC_DATAPTR, and that seems to be a non-hidden function. Could that perhaps be used? No idea how it works though.

traversc commented 10 months ago

I really hate going through those macros, but the hint that it calls STDVEC_DATAPTR pushed me in the right direction.

#include <Rcpp.h>
#include "R_ext/Altrep.h"
using namespace Rcpp;

// [[Rcpp::export]]
RawVector get_serialized_state(SEXP x) {
  SEXP ARclass = ALTREP_CLASS(x);
  char * table_ptr = static_cast<char*>( STDVEC_DATAPTR(ARclass) );
  void * serialize_state_fn_void = 
    static_cast<void*>( table_ptr + 2*sizeof(void*) );
  R_altrep_Serialized_state_method_t * serialize_state_fn = 
    static_cast<R_altrep_Serialized_state_method_t*>(serialize_state_fn_void);
  return (*serialize_state_fn)(x);
}

/*** R
obj <- get_custom_object()
get_serialized_state(obj)
*/

So we can get serialized state now. It's pretty hack-y, but technically only uses exported API. CRAN might still complain, but I'll give it a shot!

traversc commented 10 months ago

I've worked on this during the weekend and have run into a road block.

Getting the serialized_state function from the function table and re-creating the state during unserialization works fine, but converting the state back to the original object does not seem to work. It looks like this:

Serialization: Altrep_object -> ALTREP_class/function table -> serialize -> Altrep_info (pkg name, class name) + serialized_state -> write to disk

Unserialization:

Read from disk -> Altrep_info + serialized_state -?> ALTREP_class/lookup table -> unserialize -> Altrep_object

The issue is during unserialization you don't have access to the ALTREP_class or function table (both are dynamically registered), so you can't find the proper unserialization function. Everything in altrep.c is statically linked, so I see no way of getting the required pointers.

david-cortes commented 10 months ago

I've worked on this during the weekend and have run into a road block.

Getting the serialized_state function from the function table and re-creating the state during unserialization works fine, but converting the state back to the original object does not seem to work. It looks like this:

Serialization: Altrep_object -> ALTREP_class/function table -> serialize -> Altrep_info (pkg name, class name) + serialized_state -> write to disk

Unserialization:

Read from disk -> Altrep_info + serialized_state -?> ALTREP_class/lookup table -> unserialize -> Altrep_object

The issue is during unserialization you don't have access to the ALTREP_class or function table (both are dynamically registered), so you can't find the proper unserialization function. Everything in altrep.c is statically linked, so I see no way of getting the required pointers.

I haven't looked into it, but aren't the classes and methods tied to a given package namespace? When registering them through a call to e.g. R_make_altraw_class and similar, the second argument is a namespace string, which I think indicates the library (and I guess shared object / dll ?) in which the methods should be found, and loading/attaching (not sure which) these namespaces should automatically trigger the calls to R_set_altrep_Unserialize_method(<the class>, <the function>), which I guess should make them available in some method table, assuming that one keeps track of the package name during serialization.

david-cortes commented 10 months ago

Actually it looks like it might not be so easy as some functions have static: https://github.com/wch/r-source/blob/0ec3087abd2b1aa1b83966306bf8c4d6762ef194/src/main/altrep.c#L279

traversc commented 10 months ago

Once you register methods from inside your package, they are stored to a static linked object in altrep.c.

From outside the package and from outside R internals, the only link to those methods I can find is from an existing Altrep object (which I wouldn't have during unserialization).

Anyway, it's possible I'm missing something and I'd appreciate if you could take a second look! Below is a simplified version of serialization:

#include <Rcpp.h>
#include "R_ext/Altrep.h"
using namespace Rcpp;

struct altrep_serialize_methods_t {
  R_altrep_UnserializeEX_method_t UnserializeEX;
  R_altrep_Unserialize_method_t Unserialize;
  R_altrep_Serialized_state_method_t Serialized_state;
};

#define ALTREP_CLASS_SERIALIZED_CLASS(x) ATTRIB(x) // altrep.c, line 37

// [[Rcpp::export]]
SEXP altrep_serialize_state(SEXP x) {
  SEXP ARclass = ALTREP_CLASS(x); // dynamically registered, not consistent between R instances
  altrep_serialize_methods_t * table = static_cast<altrep_serialize_methods_t*>(STDVEC_DATAPTR(ARclass));
  return (table->Serialized_state)(x);
}

// [[Rcpp::export]]
SEXP altrep_serialize_class(SEXP x) {
  SEXP ARclass = ALTREP_CLASS(x);
  SEXP info = ALTREP_CLASS_SERIALIZED_CLASS(ARclass);
  return info;
}

// [[Rcpp::export]]
SEXP altrep_unserialize(SEXP state, SEXP info) {
  // SEXP ARclass = ALTREP_UNSERIALIZE_CLASS(info); // No method available
  // altrep_serialize_methods_t * table = static_cast<altrep_serialize_methods_t*>(STDVEC_DATAPTR(ARclass));
  // return (table->Unserialize)(ARclass, state);
  return R_NilValue;
}

/*** R
library(altreppedpointer)
obj <- get_custom_object()
state <- altrep_serialize_state(obj)
info <- altrep_serialize_class(obj)
recovered_obj <- altrep_unserialize(state, info);
*/
traversc commented 10 months ago

I thought of a workaround, which would be to register serialization methods with qs on load. Something like this:

.onLoad <- function(libname, pkgname) {
  # register ALTREP serialization methods

  # altreppedpointer example
  if(requireNamespace("altreppedpointer", quietly=TRUE)) {
    obj <- altreppedpointer::get_custom_object()
    qs_register_altrep_methods(obj)
  }

  # compact int range
  qs_register_altrep_methods(1:2)

  # etc ...
}

But of course that would require knowing about ALTREP classes ahead of time, and I don't know of any real packages that make good use of ALTREP.

david-cortes commented 10 months ago

I thought of a workaround, which would be to register serialization methods with qs on load. Something like this:

.onLoad <- function(libname, pkgname) {
  # register ALTREP serialization methods

  # altreppedpointer example
  if(requireNamespace("altreppedpointer", quietly=TRUE)) {
    obj <- altreppedpointer::get_custom_object()
    qs_register_altrep_methods(obj)
  }

  # compact int range
  qs_register_altrep_methods(1:2)

  # etc ...
}

But of course that would require knowing about ALTREP classes ahead of time, and I don't know of any real packages that make good use of ALTREP.

But doesn't that still lead to the same problem of not being able to get SEXP ARclass in your example? Otherwise, it should just be a matter of extracting and saving the package namespace from STDVEC_DATAPTR and then loading that namespace upon deserialization. Or am I missing something?

david-cortes commented 10 months ago

@kalibera Apologies if this is not the right medium, but it looks like you have been involved quite a lot in this mechanism for ALTREP serialization in R. Would you be able to provide some pointers on how to de-serialize an ALTREP'd class in a package for serialization of R objects?

traversc commented 10 months ago

But doesn't that still lead to the same problem of not being able to get SEXP ARclass in your example? Otherwise, it should just be a matter of extracting and saving the package namespace from STDVEC_DATAPTR and then loading that namespace upon deserialization. Or am I missing something?

My understanding:

I can get "ARclass" (and serialization methods) from an existing ALTREP object. I can't get ARclass from just knowing the namespace and classname. I also can't save ARclass directly as it's a dynamic internal object.

During deserialization, since I don't have an existing ALTREP object, I can't get the serialization methods. Looking up ARclass from namespace and classname is not part of the public API (a chicken and egg problem).

david-cortes commented 10 months ago

Since it doesn't seem to be possible to get the ALTREP class from just the package name, how about defaulting to base::serialize and base::unserialize for ALTREP classes that have custom serialization methods?

traversc commented 10 months ago

Unfortunately there are cases where base::serialize on ALTREP classes that have serialization methods would be significantly slower (arrow and vroom).

Without any support from R core, the only solution is some sort of whitelist / onLoad (the idea above) or registration by the user. E.g.:

Serialization:

library(qs)
library(altreppedpointer)
big_obj<- altreppedpointer::get_custom_object( object_size = 100000 )
qsave(big_obj, file = "myfile.qs")

De-serialization (new R instance)

library(qs)
library(altreppedpointer)
default_obj <- altreppedpointer::get_custom_object()
qs_register_altrep_methods(default_obj)
big_obj <- qread(file = "myfile.qs")
david-cortes commented 10 months ago

I'm wondering how this sort of registration could work in practice.

  1. If done by adding a .onLoad function to qs where packages are individually registered:
    • It would miss cases in which the package that produces the objects to serialize is installed in an R session in which qs has already been loaded.
    • It might create issues if package updates then change the class names or similar.
    • It would require adding lots of packages under 'Suggests' along with tests, which could potentially complicate updates for those packages in the future.
  2. If done by adding .onLoad to the packages that produce these objects:
    • There's again an issue in which installing qs in a running R session after the package has been loaded would lead to missing the registration.
    • Some package authors might not be aware of the need to register classes for qs, or might not want to have it as a dependency in the DESCRIPTION file.

It feels like the only way this registration would avoid the problem of non-installed packages would be by adding the registration call in the serialization method, but that sort of mechanism feels like a very awkward workaround and much harder to implement correctly.

Do you think perhaps that packages could define some global variable like .QS_USE_BASE_SERILIZATION_<class> or similar, so that then qs would look for such a variable in the package namespace, and revert to base::serialize if it finds it; or something along those lines?

traversc commented 10 months ago

I'd like to avoid base::serialize, which after layering qs on top would be slower than just using saveRDS. I'm not sure I agree with all of the concerns you've raised. See below:

It might create issues if package updates then change the class names or similar.

It would require adding lots of packages under 'Suggests' along with tests, which could potentially complicate updates for those packages in the future.

Yes, in general, package authors really shouldn't be changing altrep serialization format. Regardless of using base::serialize or qs a change could crash R or worse produce wrong results. Changing to class name or format would break serialization, but that's also the case for base::serialize.

Some package authors might not be aware of the need to register classes for qs, or might not want to have it as a dependency in the DESCRIPTION file.

It could also be done by the user. If left entirely up to the user instead of package authors, would that also address the concern of having extra "Suggests" and tests? Regardless, I don't know of any packages that benefit from ALTREP serialization, so right now there's no concern of extra Suggests.

david-cortes commented 10 months ago

The thing with leaving it to users is that users might not be aware that some object will not serialize correctly with qs, especially if the object is some compounded class where the altrep class in question is only interfaced through a wrapper from a different package, such as machine learning frameworks like mlr3 that might wrap a modeling package whose objects are altrepped.

In terms of particular packages, you might see xgboost's v2 release using the same serialization logic as the earlier example: https://github.com/dmlc/xgboost/issues/9810

As for current packages, there's this one package from myself for example: https://cran.r-project.org/web/packages/outliertree/index.html

But it's not a popular package and I'm not aware of any other package on CRAN currently having a non-qs-compatible serialization mechanism.


If you think that falling back to base::serialize would not be a good idea, how about issuing a warning to the user when qs tries to serialize an altrepped object that defines its own serialization methods? Is that possible to know from just a quick look at STDVEC_DATAPTR?

traversc commented 9 months ago

I asked Luke Tierney by email on accessing ALTREP serialization methods, and he pretty firmly said don't. Given that, I've used base::serialize like you suggested. Id appreciate if you could test out the latest commit.

library(qs)
library(altreppedpointer)
obj <- get_custom_object() # qs::get_altrep_class_info(obj)
qs::register_altrep_class("altrepped_pointer_class", "altreppedpointer")
qsave(obj, file = "/tmp/temp.qs")

# new instance
library(qs)
library(altreppedpointer)
qread("/tmp/temp.qs")

Registration is needed for serialization and only requires the class and package names. Since I'm using base::unserialize, deserialization does not need anything special.

I'd be happy to add xgboost (I use it all the time) and outliertree classes to qs onLoad. Since I'm only registering strings (by adding them to a static unordered_set) there's no issues with adding extra Suggests or issues with package load order.

david-cortes commented 9 months ago

Thanks for looking into this.

Can confirm that serialization for these objects works as expected on the current master branch.