Open gmbecker opened 1 year ago
I thought that symbols were required to be in the native encoding (making CHAR(PRINTNAME(...)) the most reasonable option), but I cannot find it documented anywhere. Is it really so?
I'm not sure. Thats the question. That version is thr most common as we see in the counts provided by Luke. The question is whether all others can be safely converted to that or of they are needed for some reason, and if so a) what those possible reasons are, and b) if all cases where thr alternate is needed it is actually being used.
On Mon, Aug 28, 2023, 2:47 PM aitap @.***> wrote:
I thought that symbols were required to be in the native encoding (making CHAR(PRINTNAME(...)) the most reasonable option), but I cannot find it documented anywhere. Is it really so?
— Reply to this email directly, view it on GitHub https://github.com/r-devel/r-project-sprint-2023/issues/65#issuecomment-1696473147, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG53MJMUMQHWPVJF3IH2BLXXUGWVANCNFSM6AAAAAA4AR3DFE . You are receiving this because you authored the thread.Message ID: @.***>
That is part of the question, i.e. is translateChar
ever needed and if it is, should it be used more often?
EncodeString
does something different as described in printutils.c
. But again one question is whether it should be used more or less.
I can work on checking this. I think it's possible using a combination of static call graph analysis and adding a check to SET_PRINTNAME()
and seeing what breaks.
On the other hand, nothing currently prevents the C code from creating a symbol in non-native encoding:
#define R_NO_REMAP
#include <R.h>
#include <Rinternals.h>
#include <R_ext/Rdynload.h>
LibExtern Rboolean utf8locale;
SEXP ohno(void) {
SEXP sym = PROTECT(Rf_allocSExp(SYMSXP));
SET_PRINTNAME(sym,
utf8locale ? Rf_mkCharCE("\xff", CE_LATIN1)
: Rf_mkCharCE("\xc3\xbf", CE_UTF8)
);
SET_SYMVALUE(sym, R_UnboundValue);
SET_DDVAL(sym, 0);
UNPROTECT(1);
return sym;
}
Depending on whether the current locale is UTF-8 or not, it breaks in different ways.
R symbols begin their life by being allocated using allocSExp
. The
following Coccinelle script locates all uses of allocSExp
where the
type
argument is the constant SYMSXP
or a variable:
@@
SEXPTYPE x;
@@
*allocSExp(
(
SYMSXP
|
x
)
)
spatch --recursive-includes allocSExp.cocci --dir R-devel/src
(Coccinelle speaks in unified diffs, and it indicates matches by marking them for removal.)
Let's follow the functions that allocate symbols:
allocSExp(SYMSXP)
main/names.c
static SEXP mkSymMarker(SEXP pname)
This function sets the PRINTNAME
of the symbol to pname
. It is
only called with R_NilValue
or mkChar(...)
, so is guaranteed not
to produce symbols with non-native encoding.main/dstruct.c
SEXP mkSYMSXP(SEXP name, SEXP value)
This function sets the PRINTNAME
of the symbol to name
. Let's
locate its callers:
@@
expression x, y;
@@
*mkSYMSXP(x,y)
main/names.c
SEXP install(const char *name)
Uses mkChar(name)
, so always returns a native-encoding symbol.SEXP installNoTrChar(SEXP charSXP)
Documented to be equivalent to install(CHAR(charSXP))
. May
create symbols with UTF-8 encoding when a UTF-8 locale is in
effect or Latin-1 symbols when a Latin-1 locale is in effect. A
CHARSXP
in the wrong encoding will be reinterpreted as if it was
in the native encoding, so any use of it with incompatible
CHARSXP
is a bug. A comment also says that installTrChar
should be used instead in almost all cases.
@@
expression x;
@@
*installNoTrChar(x)
main/sysutils.c
: SEXP installTrChar(SEXP x)
This function delegates to installNoTrChar
if no translation is
needed, otherwise translates the CHARSXP
and calls install
.allocSExp
in main/saveload.c
and
main/serialize.c
, but they all allocate other types. In particular,
the code in these two files always creates symbols using either
install()
or installTrChar()
.SEXP installNoTrChar(SEXP charSXP)
is also mentioned in Rinternals.h
, so could conceivably be called by packages. Is this enough of a proof that symbol names are valid in the native encoding?
Symbols ought to be treated as immutable, but it's worth checking for uses of SET_PRINTNAME
just in case:
@@
SEXP x, y;
@@
*SET_PRINTNAME(x,y)
The only two users are static SEXP mkSymMarker(SEXP pname)
and attribute_hidden SEXP mkSYMSXP(SEXP name, SEXP value)
, which we've already discussed.
Searching for allocSExp
on GitHub CRAN mirror gives a few hits, but almost all of them use a constant argument that is not SYMSXP
and the remaining uses are obviously not creating a symbol either. mkSYMSXP
gives 0 results. SET_PRINTNAME
gives 1 result that is not about setting a non-native-encoding symbol. installNoTrChar
gives a few Rust declarations. (All search links require logging in.)
This was about encoding safety. What about other operations on PRINTNAME
?
The EncodeChar
calls are there for a reason. They are used when formatting error messages, and they are needed even for native-encoding symbols in order to reproduce special characters like \n
when printing Error: object '\n' not found
.
The two uses of translateChar
may need to be changed.
src/main/subscript.c
, OneIndex(SEXP x, SEXP s, R_xlen_t nx, int partial, SEXP *newname, int pos, SEXP call)
, used in e.g. cars[[as.name('dist')]] <- NA
. Since symbols normally exist in the native encoding, translateChar
is harmless but extraneous.src/main/debug.c
, static void memtrace_stack_dump(void)
, used to print the stack trace when a memory-traced object is duplicated. Since a function could concievably be named something weird like \r\n\t\v
, this should probably be changed to EncodeChar()
.What about the majority of the CHAR(PRINTNAME(.))
uses? Many of these are used in error messages, so should probably be changed to EncodeChar()
too. I can prepare a patch to do that.
A patch would be great. Happy to review it.
Work in progress on R-devel, but involves a bit of yak shaving. The best way to present a symbol in an error message is EncodeChar
, but it uses a statically-allocated buffer that may be overwritten any time R is called. The only safe way to use it with errorcall()
is errorcall_cpy()
. It's much safer to make errorcall()
and warningcall()
stringify their format arguments before calling R (like currently done by error()
and warning()
using a temporary string buffer). We can save a lot of stack space if we rework multiple functions in errors.c
to perform the formatting only once and share the buffer while doing that. And so on...
Once this is done, I intend to implement the function EncodeSymbol(sym, buf, buf_size)
suggested by Luke Tierney to make it possible to EncodeChar()
multiple different arguments in a single call to error(...)
.
Endorsed by Luke