Open Rot127 opened 2 years ago
Confirmed with this repro:
class test29(unittest.TestCase, runner):
input = r"""
#define fCAST4_8s(A) ((int64_t)((int32_t)(A)))
#define fBIDIR_SHIFTR(SRC, SHAMT, REGSTYPE) (((SHAMT) < 0) ? ((fCAST##REGSTYPE(SRC) << ((-(SHAMT)) - 1)) << 1) : (fCAST##REGSTYPE(SRC) >> (SHAMT)))
#define fBIDIR_ASHIFTR(SRC, SHAMT, REGSTYPE) fBIDIR_SHIFTR(SRC, SHAMT, REGSTYPE##s)
#define fSXTN(N, M, VAL) (((N) != 0) ? sextract64((VAL), 0, (N)) : 0LL)
#define fHIDE(A) A
#define fECHO(A) (A)
#define DEF_SHORTCODE(TAG,SHORTCODE) insn(TAG, SHORTCODE)
DEF_SHORTCODE(S2_asr_r_r_acc, { fHIDE(size4s_t) shamt=fSXTN(7,32,RtV); RxV = fECHO(RxV + fBIDIR_ASHIFTR(RsV,shamt,4_8)); })
"""
# Getting instead:
# insn(S2_asr_r_r_acc, { size4s_t shamt=(((7) != 0) ? sextract64((RtV), 0, (7)) : 0LL); RxV = (RxV + (((shamt) < 0) ? ((fCAST4_8s(RsV) << ((-(shamt)) - 1)) << 1) : (fCAST4_8s(RsV) >> (shamt)))); })
# fCAST4_8s isn't being macro expanded
output = r"""#line 10
insn(S2_asr_r_r_acc, { size4s_t shamt=(((7) != 0) ? sextract64((RtV), 0, (7)) : 0LL); RxV = (RxV + (((shamt) < 0) ? ((((int64_t)((int32_t)(RsV))) << ((-(shamt)) - 1)) << 1) : (((int64_t)((int32_t)(RsV))) >> (shamt)))); })
"""
Thanks for the bug report!
@ned14 I distilled the repro further and tried it in pdb. By now I am convinced that the issue is the two levels of pasting tokens ~together and the way the prescan works~. However, I need to take a closer look and understand the code better to propose a fix.
Distilled version:
#define fCAST4_8s(A) A
#define SECOND_LVL_CONCAT(SRC, REGSTYPE) fCAST##REGSTYPE(SRC)
#define FIRST_LVL_CONCAT(SRC, REGSTYPE) SECOND_LVL_CONCAT(SRC, REGSTYPE##s)
FIRST_LVL_CONCAT(src,4_8)
I renamed the function-style macros for brevity.
When I run the C/C++ preprocessor on my system against it I get (the grep
invocations exist to filter out empty lines and those starting with #
followed by a number):
$ cpp smaller.h |grep -vP '^\s*$'|grep -vP '^# \d+'
src
and pcpp gives:
$ pcpp smaller.h |grep -vP '^\s*$'|grep -vP '^#line'
fCAST4_8s(src)
So we arrive at the name fCAST4_8s
only after two rounds of pasting tokens. With REGSTYPE
== 4_8
REGSTYPE##s
-> 4_8s
inside FIRST_LVL_CONCAT
fCAST##REGSTYPE
-> fCAST4_8s
inside SECOND_LVL_CONCAT
... and only then do we see that fCAST4_8s
needs expansion, too.
Alright, in pdb I can see that when we are in expand_macros()
we have the following tokens
:
[LexToken(CPP_WS,'\n',4,162), LexToken(CPP_ID,'FIRST_LVL_CONCAT',5,163), LexToken(CPP_LPAREN,'(',5,179), LexToken(CPP_ID,'src',5,180), LexToken(CPP_COMMA,',',5,183), LexToken(CPP_INTEGER,'4',5,184), LexToken(CPP_ID,'_8',5,185), LexToken(CPP_RPAREN,')',5,187), LexToken(CPP_WS,'\n',5,187)]
I think in particular the assessment of the lexer that 4_8
"dissolves" into an integer ('4'
) and an identifier ('_8'
) may be part of the problem: [..., LexToken(CPP_INTEGER,'4',5,184), LexToken(CPP_ID,'_8',5,185), ...]
...
Currently I think that this may cause the ##
-logic to fall apart, because instead of a single token preceded or succeeded by ##
now becomes two tokens in a row, preceded or succeeded by ##
.
And a short test confirms that this suspicion may indeed be true. Replacing 4_8
by r_8
thereby making it a single (identifier type) token, we can even convince pcpp
to behave nicely.
Temporary repro (to force the lexer to give us a single token):
#define fCASTr_8s(A) A
#define SECOND_LVL_CONCAT(SRC, REGSTYPE) fCAST##REGSTYPE(SRC)
#define FIRST_LVL_CONCAT(SRC, REGSTYPE) SECOND_LVL_CONCAT(SRC, REGSTYPE##s)
FIRST_LVL_CONCAT(src,r_8)
Output for cpp
remains correct (and the same), but for pcpp
we now get:
$ pcpp --debug smaller.h |grep -vP '^\s*$'|grep -vP '^#line'
src
Unfortunately the pcpp_debug.log
doesn't contain the output of how the last line is seen.
Just a thought: perhaps a notion like "debug levels" (
--debug=n
, defaulting to 1 when given) could help to turn this really verbose, which would be impractical for real-world preprocessing but invaluable for tracking down defects?!
Anyway, pdb to the rescue. This looks a lot better already ([..., LexToken(CPP_ID,'r_8',5,184), ...]
):
(Pdb) p tokens
[LexToken(CPP_WS,'\n',4,162), LexToken(CPP_ID,'FIRST_LVL_CONCAT',5,163), LexToken(CPP_LPAREN,'(',5,179), LexToken(CPP_ID,'src',5,180), LexToken(CPP_COMMA,',',5,183), LexToken(CPP_ID,'r_8',5,184), LexToken(CPP_RPAREN,')',5,187), LexToken(CPP_WS,'\n',5,187)]
So the ~lexer~ parser.py
is what needs fixing.
For reference this is the conditional breakpoint at preprocess.py:558 I set:
(Pdb) b
Num Type Disp Enb Where
1 breakpoint keep yes at /home/oliver/PCPP/pcpp/pcpp/preprocessor.py:558
stop only if len(tokens) > 0
breakpoint already hit 4 times
Side note: Looking at def CPP_INTEGER(t)
in parser.py
I think this regex:
(((((0x)|(0X))[0-9a-fA-F]+)|(\d+))([uU][lL]|[lL][uU]|[uU]|[lL])?)
can be vastly simplified and at the same time expanded to support 0b
prefixed binary integer literals and grouping with '
:
(?i)((?:0b[01]+|0x[0-9A-F]+|[\d']+)(?:UL{0,2}|L{0,2}U)?)
We only appear to use a single capture group, so make the two remaining groups non-capturing. Also set mode to case-insensitive (works both on Python 2.7 and 3.x).
What do you think, @ned14?
Seems plausible.
The hard part is coming up with a fix which doesn't break other things and is maintainable.
The hard part is coming up with a fix which doesn't break other things and is maintainable.
Indeed, which is the reason why I didn't rush to fixing it. The code I saw during debugging has to sink in first and I need to get more acquainted with the code in general. Also I think it helps to aim for highest possible test coverage before attempting to tackle this (which is why I did this tox thingamy).
FYI: The latest repro from this comment gets properly resolved with the changes from PR #80.
In the following example not all macros get resolved:
Running
pcpp test.h
returns:Note that the
fCAST4_8s
macro was not resolved.It seems to be a problem with the
##
operator.If
fCAST##REGSTYPE
is replaced withfCAST4_8s
in macrofBIDIR_SHIFTR
it works fine.pcpp version: 1.30