Incomplete macro reolving

Rot127 commented 2 years ago

In the following example not all macros get resolved:

#define fCAST4_8s(A) ((int64_t)((int32_t)(A)))
#define fBIDIR_SHIFTR(SRC, SHAMT, REGSTYPE)    (((SHAMT) < 0) ? ((fCAST##REGSTYPE(SRC) << ((-(SHAMT)) - 1)) << 1)                   : (fCAST##REGSTYPE(SRC) >> (SHAMT)))
#define fBIDIR_ASHIFTR(SRC, SHAMT, REGSTYPE)    fBIDIR_SHIFTR(SRC, SHAMT, REGSTYPE##s)
#define fSXTN(N, M, VAL) (((N) != 0) ? sextract64((VAL), 0, (N)) : 0LL)
#define fHIDE(A) A
#define fECHO(A) (A)
#define DEF_SHORTCODE(TAG,SHORTCODE)    insn(TAG, SHORTCODE)

DEF_SHORTCODE(S2_asr_r_r_acc, { fHIDE(size4s_t) shamt=fSXTN(7,32,RtV); RxV = fECHO(RxV + fBIDIR_ASHIFTR(RsV,shamt,4_8)); })

Running pcpp test.h returns:

insn(S2_asr_r_r_acc, { size4s_t shamt=(((7) != 0) ? sextract64((RtV), 0, (7)) : 0LL); RxV = (RxV + (((shamt) < 0) ? ((fCAST4_8s(RsV) << ((-(shamt)) - 1)) << 1)                   : (fCAST4_8s(RsV) >> (shamt)))); })

Note that the fCAST4_8s macro was not resolved.

It seems to be a problem with the ## operator.

If fCAST##REGSTYPE is replaced with fCAST4_8s in macro fBIDIR_SHIFTR it works fine.

pcpp version: 1.30

ned14 commented 2 years ago

Confirmed with this repro:

class test29(unittest.TestCase, runner):
    input = r"""
#define fCAST4_8s(A) ((int64_t)((int32_t)(A)))
#define fBIDIR_SHIFTR(SRC, SHAMT, REGSTYPE)    (((SHAMT) < 0) ? ((fCAST##REGSTYPE(SRC) << ((-(SHAMT)) - 1)) << 1)                   : (fCAST##REGSTYPE(SRC) >> (SHAMT)))
#define fBIDIR_ASHIFTR(SRC, SHAMT, REGSTYPE)    fBIDIR_SHIFTR(SRC, SHAMT, REGSTYPE##s)
#define fSXTN(N, M, VAL) (((N) != 0) ? sextract64((VAL), 0, (N)) : 0LL)
#define fHIDE(A) A
#define fECHO(A) (A)
#define DEF_SHORTCODE(TAG,SHORTCODE)    insn(TAG, SHORTCODE)

DEF_SHORTCODE(S2_asr_r_r_acc, { fHIDE(size4s_t) shamt=fSXTN(7,32,RtV); RxV = fECHO(RxV + fBIDIR_ASHIFTR(RsV,shamt,4_8)); })
"""
# Getting instead:
# insn(S2_asr_r_r_acc, { size4s_t shamt=(((7) != 0) ? sextract64((RtV), 0, (7)) : 0LL); RxV = (RxV + (((shamt) < 0) ? ((fCAST4_8s(RsV) << ((-(shamt)) - 1)) << 1)                   : (fCAST4_8s(RsV) >> (shamt)))); })
# fCAST4_8s isn't being macro expanded
    output = r"""#line 10
insn(S2_asr_r_r_acc, { size4s_t shamt=(((7) != 0) ? sextract64((RtV), 0, (7)) : 0LL); RxV = (RxV + (((shamt) < 0) ? ((((int64_t)((int32_t)(RsV))) << ((-(shamt)) - 1)) << 1) : (((int64_t)((int32_t)(RsV))) >> (shamt)))); })
"""

Thanks for the bug report!

assarbad commented 1 year ago

@ned14 I distilled the repro further and tried it in pdb. By now I am convinced that the issue is the two levels of pasting tokens ~together and the way the prescan works~. However, I need to take a closer look and understand the code better to propose a fix.

Distilled version:

#define fCAST4_8s(A) A
#define SECOND_LVL_CONCAT(SRC, REGSTYPE) fCAST##REGSTYPE(SRC)
#define FIRST_LVL_CONCAT(SRC, REGSTYPE)  SECOND_LVL_CONCAT(SRC, REGSTYPE##s)

FIRST_LVL_CONCAT(src,4_8)

I renamed the function-style macros for brevity.

When I run the C/C++ preprocessor on my system against it I get (the grep invocations exist to filter out empty lines and those starting with # followed by a number):

$ cpp smaller.h |grep -vP '^\s*$'|grep -vP '^# \d+'
src

and pcpp gives:

$ pcpp smaller.h |grep -vP '^\s*$'|grep -vP '^#line'
fCAST4_8s(src)

So we arrive at the name fCAST4_8s only after two rounds of pasting tokens. With REGSTYPE == 4_8

REGSTYPE##s -> 4_8s inside FIRST_LVL_CONCAT
fCAST##REGSTYPE -> fCAST4_8s inside SECOND_LVL_CONCAT

... and only then do we see that fCAST4_8s needs expansion, too.

assarbad commented 1 year ago

Alright, in pdb I can see that when we are in expand_macros() we have the following tokens:

[LexToken(CPP_WS,'\n',4,162), LexToken(CPP_ID,'FIRST_LVL_CONCAT',5,163), LexToken(CPP_LPAREN,'(',5,179), LexToken(CPP_ID,'src',5,180), LexToken(CPP_COMMA,',',5,183), LexToken(CPP_INTEGER,'4',5,184), LexToken(CPP_ID,'_8',5,185), LexToken(CPP_RPAREN,')',5,187), LexToken(CPP_WS,'\n',5,187)]

I think in particular the assessment of the lexer that 4_8 "dissolves" into an integer ('4') and an identifier ('_8') may be part of the problem: [..., LexToken(CPP_INTEGER,'4',5,184), LexToken(CPP_ID,'_8',5,185), ...] ...

Currently I think that this may cause the ##-logic to fall apart, because instead of a single token preceded or succeeded by ## now becomes two tokens in a row, preceded or succeeded by ##.

And a short test confirms that this suspicion may indeed be true. Replacing 4_8 by r_8 thereby making it a single (identifier type) token, we can even convince pcpp to behave nicely.

Temporary repro (to force the lexer to give us a single token):

#define fCASTr_8s(A) A
#define SECOND_LVL_CONCAT(SRC, REGSTYPE) fCAST##REGSTYPE(SRC)
#define FIRST_LVL_CONCAT(SRC, REGSTYPE)  SECOND_LVL_CONCAT(SRC, REGSTYPE##s)

FIRST_LVL_CONCAT(src,r_8)

Output for cpp remains correct (and the same), but for pcpp we now get:

$ pcpp --debug smaller.h |grep -vP '^\s*$'|grep -vP '^#line'
src

Unfortunately the pcpp_debug.log doesn't contain the output of how the last line is seen.

Just a thought: perhaps a notion like "debug levels" (--debug=n, defaulting to 1 when given) could help to turn this really verbose, which would be impractical for real-world preprocessing but invaluable for tracking down defects?!

Anyway, pdb to the rescue. This looks a lot better already ([..., LexToken(CPP_ID,'r_8',5,184), ...]):

(Pdb) p tokens
[LexToken(CPP_WS,'\n',4,162), LexToken(CPP_ID,'FIRST_LVL_CONCAT',5,163), LexToken(CPP_LPAREN,'(',5,179), LexToken(CPP_ID,'src',5,180), LexToken(CPP_COMMA,',',5,183), LexToken(CPP_ID,'r_8',5,184), LexToken(CPP_RPAREN,')',5,187), LexToken(CPP_WS,'\n',5,187)]

So the ~lexer~ parser.py is what needs fixing.

For reference this is the conditional breakpoint at preprocess.py:558 I set:

(Pdb) b
Num Type         Disp Enb   Where
1   breakpoint   keep yes   at /home/oliver/PCPP/pcpp/pcpp/preprocessor.py:558
        stop only if len(tokens) > 0
        breakpoint already hit 4 times

assarbad commented 1 year ago

Side note: Looking at def CPP_INTEGER(t) in parser.py I think this regex:

(((((0x)|(0X))[0-9a-fA-F]+)|(\d+))([uU][lL]|[lL][uU]|[uU]|[lL])?)

can be vastly simplified and at the same time expanded to support 0b prefixed binary integer literals and grouping with ':

(?i)((?:0b[01]+|0x[0-9A-F]+|[\d']+)(?:UL{0,2}|L{0,2}U)?)

We only appear to use a single capture group, so make the two remaining groups non-capturing. Also set mode to case-insensitive (works both on Python 2.7 and 3.x).

What do you think, @ned14?

ned14 commented 1 year ago

Seems plausible.

The hard part is coming up with a fix which doesn't break other things and is maintainable.

assarbad commented 1 year ago

The hard part is coming up with a fix which doesn't break other things and is maintainable.

Indeed, which is the reason why I didn't rush to fixing it. The code I saw during debugging has to sink in first and I need to get more acquainted with the code in general. Also I think it helps to aim for highest possible test coverage before attempting to tackle this (which is why I did this tox thingamy).

assarbad commented 1 year ago

FYI: The latest repro from this comment gets properly resolved with the changes from PR #80.

ned14 / pcpp

Incomplete macro reolving #71