Add PP_NUMBER, remove CPP_INTEGER, CPP_FLOAT

willwray commented 2 years ago

A simple proof of concept change that fixes #79. With it, pcpp can do codegen using the IREPEAT library.

I believe it's conceptually correct, but my Python may not be; please test this against your suite and review the method (hack) carefully. There's not much code! Mostly deletions.

The change removes CPP_INTEGER, effectively replacing it with PP_NUMBER, and entirely removes CPP_FLOAT as superfluous for preprocessing purposes.

pp-number is sufficient for preprocessing to stage 4

The pp-number regex in the issue is incorrect, lifted from unpublished WG21 https://isocpp.org/files/papers/D2180R0.html "pp-number makes cpp dumber" (best proposal title ever).

Instead, I crafted a regex based on the lastest C++ draft https://eel.is/c++draft/lex.ppnumber#ntref:pp-number which accepts character ' as digit separator:

regex string r'.?\d(.|[\w]|\'[\w]|[eEpP][-+])*'

(also admits binary literals, with digit separator, of course, so they can now be added to the Value parsing code)

Only the conditional evaluator is required to interpret the numbers as integer constant expressions.

This is achieved by hacky means:

def p_expression_number(p):
    'expression : PP_NUMBER'
    try:
        p[0] = Value(p[1])
    except:
        p[0] = p[1]

The idea is that if the parsed string p[1] can be interpreted as an integer constant-expression Value(p[1]) then do so, otherwise simply pass through the string for possible further pasting and processing.

A robust method might check p[1] against the CPP_INTEGER regex (removed in this commit) for a full match, consuming all input. On the other hand, relying on Value to validate the input while parsing and to raise an exception on failure may be Pythonic.

It seems that pp-number itself is a hack in the standard; I see no way to incorporate pp-number alongside INTEGER and FLOAT tokens meaningful in C; but then there's no need to. Happy Thanksgiving!

willwray commented 2 years ago

Added tests/issue0079.py FAILs pre-patch as expected

FAIL: runTest (tests.issue0079.pp_number_pasting)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/will/repos/pcpp/tests/issue0079.py", line 21, in runTest
    self.assertEqual(p.return_code, 0)
AssertionError: 1 != 0

All python setup.py test tests pass post-patch.

Are there other tests to be run?

Should I commit the new generated lextab.py and parsetab.py?

ned14 commented 2 years ago

Should I commit the new generated lextab.py and parsetab.py?

If you wish.

It'll be next Spring at least before I can look at this properly.

willwray commented 2 years ago

ok, here are the generated tables, to pick or drop if and when merged.

Maybe @assarbad can review as it's closely related to his https://github.com/ned14/pcpp/issues/71#issuecomment-1315961357 also suggests tweaking the CPP_INTEGER token that this PR eliminates.

I may be missing something obvious... I see no need for CPP_INTEGER or CPP_FLOAT in a pure preprocessor. Am I missing some use-case that requires phase-7 processing, perhaps. The test-c stuff, and all the c- tests there...

If you authorise the PR action then it should confirm that the tests pass or flag up any issue I may have missed.

willwray commented 2 years ago

Thanks for coding, publishing and maintaining this project. I'd come across it before, forgotten, then found again.

For my use case, I'm actually more interested in a fix for issue #53 ... How easy do you gauge that to be? Maybe I learn some Python file system handling

assarbad commented 2 years ago

Thanks for coding, publishing and maintaining this project. I'd come across it before, forgotten, then found again.

For my use case, I'm actually more interested in a fix for issue #53 ... How easy do you gauge that to be? Maybe I learn some Python file system handling

I am also looking at issue #53 myself. And I'll review your patch, but not tonight.

assarbad commented 2 years ago

FYI: I looked at your willwray/pcpp:pp_number branch, @willwray, and tested it against the repro from #71. And lo and behold the issue is gone.

I'll now take a closer look at the actual changes over what's on master here.

assarbad commented 2 years ago

So we have the following bits (correct me if I got it wrong):

pp-number:
- digit
- . digit [ == \.\d ]
- pp-number identifier-continue
- pp-number ' digit [ += '\d ]
- pp-number ' nondigit [ += [a-zA-Z0-9_] ]
- pp-number e sign
- pp-number E sign
- pp-number p sign
- pp-number P sign
- pp-number .

... where:

digit corresponds to \d
identifier-continue corresponds to [\w_], which would include non-ASCII letter code points (at least the description seems to include other Unicode letter code points, as per XID_Continue and this)
nondigit corresponds to pp-number + [a-zA-Z0-9_] NB: interestingly as per the description this does not include other letters than a-z in lower and upper case

The rest is relatively self-explanatory.

The description you referenced already gives us a lot of leeway. And it may be appropriate to keep it that greedy and that broad. However, with the above remarks in mind I'd like to propose changing it to:

\.?\d(?:\.|[\w_]|'[\w_]|[eEpP][-+])*

This:

makes it non-capturing
removes the escaping of ' (was unnecessary)

We should probably also define what we expect to be matched here. Admittedly my proposed regex from #71 may be too strict.

My change does not address:

the difference between nondigit and identifier-continue
it doesn't make it "crispier", i.e. more to the point, such as really matching '(?:\d|[a-zA-Z0-9_]) which in the above is merely '[\w_]

What say you, @ned14 and @willwray?

willwray commented 2 years ago

Thanks @assarbad for your thorough detailed review. I'll aim to resolve all comments at the weekend.

ned14 commented 1 year ago

For me, if GCC and Wave's preprocessor use the same technique to solve a conformance problem, then it seems appropriate that pcpp can do the same.

Whether the technique is implemented correctly is another matter, but I'm sure we'll get there.

willwray commented 1 year ago

[ ] is the pcpp/evaluator.py change to p_expression_number kosher or a horrid hack?

The hack is highlighted in the top PR message and it's right up top of the listed Files changed tab view.
Please take a quick look. It's a few lines diff but doing it differently might take much more careful coding.

It effectively relies on Python exceptions for validating the Value(p[1]) evaluation.

p_expression_number originally used CPP_INTEGER as its input match then delegated direct to Value():

def p_expression_number(p):
    'expression : CPP_INTEGER'
    p[0] = Value(p[1])

p_expression_number now uses CPP_NUMBER as input match instead:

def p_expression_number(p):
    'expression : PP_NUMBER'
    try:
        p[0] = Value(p[1])
    except:
        p[0] = p[1]

Is it right and Pythonic to rely on exceptions to do the validation like this?

By the fact that the tests all pass, it appears that Value() is quite robust in parsing from CPP_NUMBER instead, and raising an exception if there's a failure to parse out the complete token character string. Perhaps there are more stress-test cases to run against this change.

I don't see a clean alternative way to do this that meshes properly with the grammar and parser., and still allows for pasting to form numeric literals from smaller pp-token parts.

willwray commented 1 year ago

[ ] the unicode question

The CPP_NUMBER regex definition invokes the specification of the pp-number regular expression that requires XID_Continue.

This raises further questions:

to what extent is pcpp already unicode-enabled?
what other token regex or code might call for unicodification?

Out of my depth, I asked a couple of questions on WG21 sg16 unicode study group mailing list.

I guess the regex should be extended to parse XID_Continue.
Does anyone here have a clue how to do that?

A regex engine that supports UTS#18 (Unicode Regular Expressions) rule RL2.7 would allow matching a character with the XID_Continue property with the syntax \p{XID_Continue}. Unicode Utilities example. Note that the example matches the identifier (but not the mathematical character nor the clown faces).
I don't think the Python regex engine supports that though, so you'll have to find another regex engine or another method to perform the match.
Clang is also going to allow the mathematical profile specified in L2/22-230 in identifiers. C++ may eventually allow them as well.

Any other pp-token that should be considered for unicodification?

Some other tokens incorporate an identifier user-defined-literal does.

The Python standard library re regular expression module notes: https://docs.python.org/3/library/re.html

See also The third-party regex module, which has an API compatible with the standard library re module, but offers additional functionality and a more thorough Unicode support.

assarbad commented 1 year ago

def p_expression_number(p):
    'expression : PP_NUMBER'
    try:
        p[0] = Value(p[1])
    except:
        p[0] = p[1]
Is it right and Pythonic to rely on exceptions to do the validation like this?

The only un-Pythonic thing I see here is that the exception isn't caught by listing concrete exceptions or at least a base exception type. This is considered sort of an anti-pattern. As an example, this particular code should probably never raise a KeyError, but if it does it probably signals an actual logic error rather than something you may want to catch. I think in this case probably (but not 100% sure) you are looking for TypeError. If you have an example with this ambiguity a test could be used to validate your assumptions.

See https://docs.python.org/3/library/exceptions.html#base-classes

Otherwise I don't see much of an issue trying to treat something one way and, failing that, falling back to the alternative. AFAIK nothing un-Pythonic about that.

This raises further questions:

* to what extent is `pcpp` already unicode-enabled?
* what other token regex or code might call for unicodification?

As long as you stick to character classes such as \w/\W or \s/\S and so on, you're fairly safe with Python. It already considers Unicode to some extent. But probably not specifically to the extent of supporting Unicode character category XID_Continue. At least from what I found \p{categoryname} isn't supported by re. The case of t_CPP_ID for example seems to be approximated well enough.

See also The third-party regex module, which has an API compatible with the standard library re module, but offers additional functionality and a more thorough Unicode support.

I don't know. As an option this would probably be neat, but outright depending on it would not be too nice, given currently the set of external dependencies is quite limited. It's true that this may eventually become a concern, but as we have learned compilers already had to backpedal on allowed Unicode characters to some extent (changing direction and such), because otherwise editors would outright hide code. And as you noted yourself the debate is far from finished, so I think that bridge ought to be crossed when we get there, not necessarily right now.

Also keep in mind that other than for measuring header heft, pcpp probably is used for cases such as amalgamating libraries into single-header. This means latest if the compiler see the results of what pcpp spat out, you'll get the feedback whether something was wrong.

ned14 commented 1 year ago

I believe pcpp on Python 3 makes a reasonable attempt at Unicode. C++ has been evolving there though, as has C, and there are differences which are being reconciled. TBH I think a reasonable attempt is reasonable :)

willwray commented 1 year ago

I spent too long yesterday following up on Tom H's unicode links - my conclusion was that the security and stability concerns for languages, around unicode identifiers and exploits, are very unlikely to be concerns for pcpp as a pure preprocessor - it seems ok for pcpp to be lax in what it accepts as identifiers.

(p.s. It looks like that CI failure is on Python 2 due to the added StringIO context manager - I can revert that)

willwray commented 1 year ago

In particular, I like the idea that pcpp accept the mathematical profile identifiers that gcc currently accepts (rejects in -pedantic), clang 14 accepted and now rejects in 15 (but is adding an extension flag to support it) and that look likely to be incorporated in C++ once there's consensus.

ned14 commented 1 year ago

I go on annual vacation next week, I'm hoping some time will present itself to review all the open PRs everywhere across my open source. No promises, it's a hectic three weeks coming for me.

willwray commented 1 year ago

Thanks Niall; that'd be a nice surprise - don't stress it though.

I'll aim to understand more of the Value code and learn from @assarbad's review comment.

This PR I'm only really wanting in order to replace MSVC preprocessor with nicer output (fewer bogus empty lines).

It's the has_include feature I'm more interested in... :christmas_tree: :gift: :santa:

ned14 / pcpp

Add PP_NUMBER, remove CPP_INTEGER, CPP_FLOAT #80