r-devel / r-project-sprint-2023

Material for the R project sprint
https://contributor.r-project.org/r-project-sprint-2023/
17 stars 3 forks source link

Rationalize (unify if possible) retrieval of symbol names for error messages #65

Open gmbecker opened 1 year ago

gmbecker commented 1 year ago

Endorsed by Luke

This is done (at least) 3 different ways in base C code:

   2 translateChar(PRINTNAME(...))
  27 EncodeChar(PRINTNAME(...))
 122 CHAR(PRINTNAME(...))

 Can we standardize only one? If not, make it easier to see which
 why a particular one is needed?
aitap commented 1 year ago

I thought that symbols were required to be in the native encoding (making CHAR(PRINTNAME(...)) the most reasonable option), but I cannot find it documented anywhere. Is it really so?

gmbecker commented 1 year ago

I'm not sure. Thats the question. That version is thr most common as we see in the counts provided by Luke. The question is whether all others can be safely converted to that or of they are needed for some reason, and if so a) what those possible reasons are, and b) if all cases where thr alternate is needed it is actually being used.

On Mon, Aug 28, 2023, 2:47 PM aitap @.***> wrote:

I thought that symbols were required to be in the native encoding (making CHAR(PRINTNAME(...)) the most reasonable option), but I cannot find it documented anywhere. Is it really so?

— Reply to this email directly, view it on GitHub https://github.com/r-devel/r-project-sprint-2023/issues/65#issuecomment-1696473147, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG53MJMUMQHWPVJF3IH2BLXXUGWVANCNFSM6AAAAAA4AR3DFE . You are receiving this because you authored the thread.Message ID: @.***>

ltierney commented 1 year ago

That is part of the question, i.e. is translateChar ever needed and if it is, should it be used more often?

EncodeString does something different as described in printutils.c. But again one question is whether it should be used more or less.

aitap commented 1 year ago

I can work on checking this. I think it's possible using a combination of static call graph analysis and adding a check to SET_PRINTNAME() and seeing what breaks.

On the other hand, nothing currently prevents the C code from creating a symbol in non-native encoding:

#define R_NO_REMAP
#include <R.h>
#include <Rinternals.h>
#include <R_ext/Rdynload.h>

LibExtern Rboolean utf8locale;

SEXP ohno(void) {
    SEXP sym = PROTECT(Rf_allocSExp(SYMSXP));
    SET_PRINTNAME(sym,
        utf8locale ? Rf_mkCharCE("\xff", CE_LATIN1)
        : Rf_mkCharCE("\xc3\xbf", CE_UTF8)
    );
    SET_SYMVALUE(sym, R_UnboundValue);
    SET_DDVAL(sym, 0);
    UNPROTECT(1);
    return sym;
}

Depending on whether the current locale is UTF-8 or not, it breaks in different ways.

aitap commented 1 year ago

R symbols begin their life by being allocated using allocSExp. The following Coccinelle script locates all uses of allocSExp where the type argument is the constant SYMSXP or a variable:

@@
SEXPTYPE x;
@@
*allocSExp(
(
  SYMSXP
|
  x
)
 )
spatch --recursive-includes allocSExp.cocci --dir R-devel/src

(Coccinelle speaks in unified diffs, and it indicates matches by marking them for removal.)

Let's follow the functions that allocate symbols:

allocSExp(SYMSXP)

SEXP installNoTrChar(SEXP charSXP) is also mentioned in Rinternals.h, so could conceivably be called by packages. Is this enough of a proof that symbol names are valid in the native encoding?

aitap commented 1 year ago

Symbols ought to be treated as immutable, but it's worth checking for uses of SET_PRINTNAME just in case:

@@
SEXP x, y;
@@
*SET_PRINTNAME(x,y)

The only two users are static SEXP mkSymMarker(SEXP pname) and attribute_hidden SEXP mkSYMSXP(SEXP name, SEXP value), which we've already discussed.

Searching for allocSExp on GitHub CRAN mirror gives a few hits, but almost all of them use a constant argument that is not SYMSXP and the remaining uses are obviously not creating a symbol either. mkSYMSXP gives 0 results. SET_PRINTNAME gives 1 result that is not about setting a non-native-encoding symbol. installNoTrChar gives a few Rust declarations. (All search links require logging in.)

This was about encoding safety. What about other operations on PRINTNAME?

The EncodeChar calls are there for a reason. They are used when formatting error messages, and they are needed even for native-encoding symbols in order to reproduce special characters like \n when printing Error: object '\n' not found.

The two uses of translateChar may need to be changed.

What about the majority of the CHAR(PRINTNAME(.)) uses? Many of these are used in error messages, so should probably be changed to EncodeChar() too. I can prepare a patch to do that.

ltierney commented 1 year ago

A patch would be great. Happy to review it.

aitap commented 1 year ago

Work in progress on R-devel, but involves a bit of yak shaving. The best way to present a symbol in an error message is EncodeChar, but it uses a statically-allocated buffer that may be overwritten any time R is called. The only safe way to use it with errorcall() is errorcall_cpy(). It's much safer to make errorcall() and warningcall() stringify their format arguments before calling R (like currently done by error() and warning() using a temporary string buffer). We can save a lot of stack space if we rework multiple functions in errors.c to perform the formatting only once and share the buffer while doing that. And so on...

Once this is done, I intend to implement the function EncodeSymbol(sym, buf, buf_size) suggested by Luke Tierney to make it possible to EncodeChar() multiple different arguments in a single call to error(...).