SyntaxError on use in expression of symbol with leading decimal digits

willwray commented 1 year ago

Here's a reduced reproducer:

#define Ox 0x
#if Ox
#endif

then pcpp test.h gives

test.h:3 error: Could not evaluate expression
 due to SyntaxError("around token 'x' type CPP_ID") (passed to evaluator: '0x')

It looks like leading decimal digits are eagerly stripped when parsed for the expression.

willwray commented 1 year ago

debugpy/launcher 37201 -- -m pcmd test.h

PyInt_FromLong not found.
PyInt_FromLong not found.
PyInt_FromLong not found.
PyInt_FromLong not found.
PyInt_FromLong not found.
PyInt_FromLong not found.
PyInt_FromLong not found.
PyInt_FromLong not found.
test.h:3 error: Could not evaluate expression due to SyntaxError("around token 'x' type CPP_ID") (passed to evaluator: '0x')
PyInt_FromLong not found.

ned14 commented 1 year ago

That's invalid input, and it did give a fairly good hint as to what's invalid about it.

willwray commented 1 year ago

Oops, I was overzealous in reducing the reproducer to less-than minimal... Here's a reproducer that actually preprocesses

#define CAT_(A,B)A##B
#define CAT(A,B)CAT_(A,B)

#define Ox 0x
#if CAT(Ox,0)
#endif

willwray commented 1 year ago

It appears that (passed to evaluator: '0x0') is somehow lexed as CPP_INTEGER followed by CPP_ID where it should remain a preprocessor token

willwray commented 1 year ago

FYI, the error was hit using pcpp to do codegen with this preprocessing library https://github.com/willwray/IREPEAT in processing 'vertical' repetitions - here's one of the many problematic lines https://github.com/willwray/IREPEAT/blob/master/VREPEATx10.hpp#L11

(it works with gcc, clang, and the new conforming msvc preprocessor)

willwray commented 1 year ago

Also FYI, I'm looking at using pcpp to create an amalgamated header (convenient for use on Compiler Explorer via a single #include<url>)

I'm also evaluating if it can create nicer codegen than the native cpp's. It seems to create more empty lines than gcc and clang, but far fewer than msvc.

willwray commented 1 year ago

the PyInt_FromLong not found. spam seems to be coming from the debugger - a red herring

willwray commented 1 year ago

pcpp lacks a pp-number token (C++ link; same for C11 and C99) so the tokenization is wrongly choosing CPP_INTEGER

> ppint = r'(((((0x)|(0X))[0-9a-fA-F]+)|(\d+))([uU][lL]|[lL][uU]|[uU]|[lL])?)'
> match = re.search(ppint,"0x")
> match.group()
: '0'

when it should choose pp-number as the max-munch

> ppnum = r".?[0-9]([A-Za-z_][\w_]*|[eEpP][-+]|'[a-zA-Z0-9_])*"
> match = re.search(ppnum,"0x")
> match.group()
: '0x'

In phase 3 input is decomposed into preprocessing tokens, then phase 4 executes # directives and recurses back through 1,2,3...

Only in phase 7 are preprocessing tokens converted into tokens for translation.

pcpp only has one set of tokens (I'm trying to hack in a CPP_NUMBER token, no luck yet)

willwray commented 1 year ago

Help! Can't work out how to hack it.

Do the lextab.py and parsetab.py tables have to be regenerated? If so, how?

There's a comment on the in_production variable:

in_production = 1  # Set to 0 if editing pcpp implementation!

When set to zero and my edits are still ignored - PLY introspects the new CPP_NUMBER token then it seems to get lost at some point (maybe because the table files are used).

willwray commented 1 year ago

Related issue #71, also notes the incorrect parse as glued CPP_INTEGER and CPP_ID.

willwray commented 1 year ago

This could be a straightforward fix (still can't work out how to test it).

The current gcc lex.cc only processes CPP_NUMBER.

This 2001 bugfix commit to the C preprocessor c-lex.c (c_lex): Remove CPP_INT, CPP_FLOAT cases

Don't use CPP_INT, CPP_FLOAT; CPP_NUMBER is enough

shows pp-number is sufficient for preprocessor lexing.

Then, for evaluator.py processing of #if conditionals, only "After all macro expansion and evaluation of ... ." "Then the expression is evaluated as an integral constant expression"CPP_INTEGER

The current evaluator should correctly interpret any CPP_INTEGER.

In other words, CPP_INTEGER should be needed only for the evaluator (and where the CPP_INTEGER##CPP_ID combo is a UDL user-defined literal)

Possible issues

pp-number is a broad superset that can parse invalid
see lex.cc cpp_avoid_paste "avoid an accidental token paste"

ned14 commented 1 year ago

You may find the ply parser docs at https://www.dabeaz.com/ply/ of use on how it works and generates the precalculated table files.

willwray commented 1 year ago

Related issue in Boost.Wave :wave: BOOST_PP_CAT(1e, -1) pp-token bug fixed early 2006

ned14 / pcpp

SyntaxError on use in expression of symbol with leading decimal digits #79