kalibera / rchk

102 stars 10 forks source link

Values protected by assignment #28

Closed lionel- closed 3 years ago

lionel- commented 3 years ago

Could rchk consider value protected after being assigned to a protected container?

PROTECT(shelter);

SEXP value = Rf_cons(R_NilValue, R_NilValue)
SET_VECTOR_ELT(shelter, 0, value);

This pattern of create-and-assign helps avoids distracting PROTECT/UNPROTECT boilerplate when initialising objects.

kalibera commented 3 years ago

Yes, but, this is supported, this should already be working. It may be that the LLVM IR in some cases looks differently from what the tool expects (or what it looked several LLVM versions ago). Would you have an example where it doesn't work?

lionel- commented 3 years ago

hmm it might be that this is supported:

assign(x, new());

But this isn't:

SEXP value = new();
assign(x, value);
do(value);

I'll need to set up rchk locally to confirm (I can't today but will try tomorrow).

kalibera commented 3 years ago

That it supported as well, at least in some situations (see freshvars.cpp:560). It is only a heuristic, value is assumed protected after being passed to a setter function linking it to a protected object, but it is not tracked if/when value becomes unprotected.

lionel- commented 3 years ago

Here is a reprex:

void assign(SEXP x, int i, SEXP y) {
  SET_VECTOR_ELT(x, 0, y);
}

SEXP foo(SEXP x) {
  SEXP value = Rf_allocVector(INTSXP, 1);

  assign(x, 0, value);
  Rf_allocVector(INTSXP, 1);

  return value;
}

This causes the error

Function foo
  [UP] unprotected variable value while calling allocating function Rf_allocVector

I find it unintuitive that allocation is detected deep in the stack but not protection. I guess this is about disallowing wrapper APIs again.

This also prevents patterns like:

struct my_obj {
  int data;
  SEXP obj;
};

void obj_set(struct my_obj* x, SEXP value) {
  SET_VECTOR_ELT(x->obj, 0, value);
}

SEXP foo(struct my_obj* x) {
  SEXP value = Rf_allocVector(INTSXP, 1);
  obj_set(x, value);

  Rf_allocVector(INTSXP, 1);

  return value;
}

I'd really appreciate if rchk allowed to generalise the protection-by-assignment pattern. This pattern is of great help when initialising large objects because it reduces boilerplate (fewer PROTECT() calls) and doesn't require bookkeeping of the protection count, making it easier to write and modify this sort of code.

kalibera commented 3 years ago

Ok, now I finally understand that assign was to mean a custom function doing some kind of a custom memory protection. This is not supported. What is supported is protecting locally by assigning (setting) a value to a protected object via R API, such as SET_VECTOR_ELT itself, to reduce performance overheads and excessive boilerplate, but the support in rchk is only a heuristic, it doesn't model the protection dependency in any way, it just assumes the value becomes protected forever.

Theoretically it might be nice if the tool could model the heap to the level of detail which could reason about reachability regardless of the API used, while still providing useful bug reports, but I don't think this is feasible. The per-function modelling and using specific protection API is a key part of the design, it is not anything that could be changed, it is something that I had considered when implementing the tool. Only protection functions in the R API (now or when added) will be supported.

The tool was tuned on the base R code and code of many CRAN packages. In principle, there will always be some false alarms as long as the tool is also doing something useful, this is a general property of bug finding tools, there are also false alarms for the base R code, and existence of an rchk warning is not a reason to reject a package from CRAN; one is just asked to check whether it is a false alarm. If there are still cases where it does not work well (too many false alarms for reasonable code, typically with C++), the package can also be black-listed from the rchk checks.

lionel- commented 3 years ago

Thank you for following up.

Another question, is there a reason why this works:

SEXP foo(SEXP non_local) {
  SEXP value = Rf_allocVector(INTSXP, 1);
  SET_VECTOR_ELT(non_local, 0, value);

  Rf_allocVector(INTSXP, 1);

  return value;
}

But this doesn't?

// Initialised on load
SEXP non_local = NULL;

SEXP foo() {
  SEXP value = Rf_allocVector(INTSXP, 1);
  SET_VECTOR_ELT(non_local, 0, value);

  Rf_allocVector(INTSXP, 1);

  return value;
}

I would expect protection status to be equivalent for arguments and global variables.

kalibera commented 3 years ago

Function arguments should be protected by convention, hence the tool assumes they are. The tool also detects some cases when they are not at call sites. There is no convention/nothing is known about global variables, hence they are not assumed protected.

lionel- commented 3 years ago

Isn't it reasonable to assume global variables to be suitably protected?

Related issue, it seems that defineVar() isn't part of the set of protect-by-assignment functions:

// Function foo
//   [UP] unprotected variable value while calling allocating function Rf_allocVector

SEXP foo7(SEXP x, SEXP sym) {
  SEXP value = PROTECT(Rf_allocVector(INTSXP, 1));
  Rf_defineVar(sym, value, x);
  UNPROTECT(1);

  Rf_allocVector(INTSXP, 1);

  return value;
}
kalibera commented 3 years ago

No, I don't think it is reasonable to assume that global variables are protected. It would be nice in theory to be able to model what is on the precious list, so to kind of know is protected globally, but I don't think that is feasible.

Not all functions in the R API which assign (sometimes/always) pointers to a known-protected object are understood as setter functions. This was tuned based on existing code, and the set should not be too large so that supporting them is really beneficial for the code being readable/clear. defineVar() among other things is not in the public API. Good/defensive code would only rely on that in a very small subset of cases, including SET_VECTOR_ELT, SET_STRING_ELT, etc. Note: to take advantage of such setter functions to reduce the boiler-plate, one would not protect the fresh argument, so also assuming the function is safe to call with an unprotected argument, which is a violation of the common rule that the caller process. It should be rare.

lionel- commented 3 years ago

In my API I'm trying to push the convention that all setters protect their arguments. I find it creates more readable and maintainable code. Unfortunately I'm struggling with rchk which imposes current practice as the only practice, even though it is (in my opinion) lacking.

Here is the macro in question:

#define r_env_poke(ENV, SYM, VALUE)             \
  (Rf_defineVar(SYM, KEEP(VALUE), ENV),         \
   FREE(1),                                     \
   (void) NULL) 

This of course supposes that SYM does not allocate, which is common in our code (e.g. https://github.com/r-lib/rlang/blob/0c7eb5d9/src/internal/eval-tidy.c#L184) because we cache symbols in a global struct (https://github.com/r-lib/rlang/blob/0c7eb5d9/src/rlang/globals.h#L42).

Rf_defineVar() may not officially be part of the API but it really should be. It is essential to several NSE packages.