Rationalize (unify if possible) retrieval of symbol names for error messages

gmbecker commented 1 year ago

Endorsed by Luke

This is done (at least) 3 different ways in base C code:

   2 translateChar(PRINTNAME(...))
  27 EncodeChar(PRINTNAME(...))
 122 CHAR(PRINTNAME(...))

 Can we standardize only one? If not, make it easier to see which
 why a particular one is needed?

aitap commented 1 year ago

I thought that symbols were required to be in the native encoding (making CHAR(PRINTNAME(...)) the most reasonable option), but I cannot find it documented anywhere. Is it really so?

gmbecker commented 1 year ago

I'm not sure. Thats the question. That version is thr most common as we see in the counts provided by Luke. The question is whether all others can be safely converted to that or of they are needed for some reason, and if so a) what those possible reasons are, and b) if all cases where thr alternate is needed it is actually being used.

On Mon, Aug 28, 2023, 2:47 PM aitap @.***> wrote:

I thought that symbols were required to be in the native encoding (making CHAR(PRINTNAME(...)) the most reasonable option), but I cannot find it documented anywhere. Is it really so?

— Reply to this email directly, view it on GitHub https://github.com/r-devel/r-project-sprint-2023/issues/65#issuecomment-1696473147, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG53MJMUMQHWPVJF3IH2BLXXUGWVANCNFSM6AAAAAA4AR3DFE . You are receiving this because you authored the thread.Message ID: @.***>

ltierney commented 1 year ago

That is part of the question, i.e. is translateChar ever needed and if it is, should it be used more often?

EncodeString does something different as described in printutils.c. But again one question is whether it should be used more or less.

aitap commented 1 year ago

I can work on checking this. I think it's possible using a combination of static call graph analysis and adding a check to SET_PRINTNAME() and seeing what breaks.

On the other hand, nothing currently prevents the C code from creating a symbol in non-native encoding:

#define R_NO_REMAP
#include <R.h>
#include <Rinternals.h>
#include <R_ext/Rdynload.h>

LibExtern Rboolean utf8locale;

SEXP ohno(void) {
    SEXP sym = PROTECT(Rf_allocSExp(SYMSXP));
    SET_PRINTNAME(sym,
        utf8locale ? Rf_mkCharCE("\xff", CE_LATIN1)
        : Rf_mkCharCE("\xc3\xbf", CE_UTF8)
    );
    SET_SYMVALUE(sym, R_UnboundValue);
    SET_DDVAL(sym, 0);
    UNPROTECT(1);
    return sym;
}

Depending on whether the current locale is UTF-8 or not, it breaks in different ways.

aitap commented 1 year ago

R symbols begin their life by being allocated using allocSExp. The following Coccinelle script locates all uses of allocSExp where the type argument is the constant SYMSXP or a variable:

@@
SEXPTYPE x;
@@
*allocSExp(
(
  SYMSXP
|
  x
)
 )

spatch --recursive-includes allocSExp.cocci --dir R-devel/src

(Coccinelle speaks in unified diffs, and it indicates matches by marking them for removal.)

Let's follow the functions that allocate symbols:

allocSExp(SYMSXP)

main/names.c
- static SEXP mkSymMarker(SEXP pname) This function sets the PRINTNAME of the symbol to pname. It is only called with R_NilValue or mkChar(...), so is guaranteed not to produce symbols with non-native encoding.
main/dstruct.c
- SEXP mkSYMSXP(SEXP name, SEXP value) This function sets the PRINTNAME of the symbol to name. Let's locate its callers:
```
@@
expression x, y;
@@
*mkSYMSXP(x,y)
```
- main/names.c
  - SEXP install(const char *name) Uses mkChar(name), so always returns a native-encoding symbol.
  - SEXP installNoTrChar(SEXP charSXP) Documented to be equivalent to install(CHAR(charSXP)). May create symbols with UTF-8 encoding when a UTF-8 locale is in effect or Latin-1 symbols when a Latin-1 locale is in effect. A CHARSXP in the wrong encoding will be reinterpreted as if it was in the native encoding, so any use of it with incompatible CHARSXP is a bug. A comment also says that installTrChar should be used instead in almost all cases.
```
@@
expression x;
@@
*installNoTrChar(x)
```
  - main/sysutils.c: SEXP installTrChar(SEXP x) This function delegates to installNoTrChar if no translation is needed, otherwise translates the CHARSXP and calls install.
There are a few uses of allocSExp in main/saveload.c and main/serialize.c, but they all allocate other types. In particular, the code in these two files always creates symbols using either install() or installTrChar().

SEXP installNoTrChar(SEXP charSXP) is also mentioned in Rinternals.h, so could conceivably be called by packages. Is this enough of a proof that symbol names are valid in the native encoding?

aitap commented 1 year ago

Symbols ought to be treated as immutable, but it's worth checking for uses of SET_PRINTNAME just in case:

@@
SEXP x, y;
@@
*SET_PRINTNAME(x,y)

The only two users are static SEXP mkSymMarker(SEXP pname) and attribute_hidden SEXP mkSYMSXP(SEXP name, SEXP value), which we've already discussed.

Searching for allocSExp on GitHub CRAN mirror gives a few hits, but almost all of them use a constant argument that is not SYMSXP and the remaining uses are obviously not creating a symbol either. mkSYMSXP gives 0 results. SET_PRINTNAME gives 1 result that is not about setting a non-native-encoding symbol. installNoTrChar gives a few Rust declarations. (All search links require logging in.)

This was about encoding safety. What about other operations on PRINTNAME?

The EncodeChar calls are there for a reason. They are used when formatting error messages, and they are needed even for native-encoding symbols in order to reproduce special characters like \n when printing Error: object '\n' not found.

The two uses of translateChar may need to be changed.

One is in src/main/subscript.c, OneIndex(SEXP x, SEXP s, R_xlen_t nx, int partial, SEXP *newname, int pos, SEXP call), used in e.g. cars[[as.name('dist')]] <- NA. Since symbols normally exist in the native encoding, translateChar is harmless but extraneous.
The other is in src/main/debug.c, static void memtrace_stack_dump(void), used to print the stack trace when a memory-traced object is duplicated. Since a function could concievably be named something weird like \r\n\t\v, this should probably be changed to EncodeChar().

What about the majority of the CHAR(PRINTNAME(.)) uses? Many of these are used in error messages, so should probably be changed to EncodeChar() too. I can prepare a patch to do that.

ltierney commented 1 year ago

A patch would be great. Happy to review it.

aitap commented 1 year ago

Work in progress on R-devel, but involves a bit of yak shaving. The best way to present a symbol in an error message is EncodeChar, but it uses a statically-allocated buffer that may be overwritten any time R is called. The only safe way to use it with errorcall() is errorcall_cpy(). It's much safer to make errorcall() and warningcall() stringify their format arguments before calling R (like currently done by error() and warning() using a temporary string buffer). We can save a lot of stack space if we rework multiple functions in errors.c to perform the formatting only once and share the buffer while doing that. And so on...

Once this is done, I intend to implement the function EncodeSymbol(sym, buf, buf_size) suggested by Luke Tierney to make it possible to EncodeChar() multiple different arguments in a single call to error(...).

r-devel / r-project-sprint-2023

Rationalize (unify if possible) retrieval of symbol names for error messages #65