Open willwray opened 2 years ago
Added tests/issue0079.py
FAILs pre-patch as expected
FAIL: runTest (tests.issue0079.pp_number_pasting)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/will/repos/pcpp/tests/issue0079.py", line 21, in runTest
self.assertEqual(p.return_code, 0)
AssertionError: 1 != 0
All python setup.py test
tests pass post-patch.
Are there other tests to be run?
Should I commit the new generated lextab.py
and parsetab.py
?
Should I commit the new generated lextab.py and parsetab.py?
If you wish.
It'll be next Spring at least before I can look at this properly.
ok, here are the generated tables, to pick or drop if and when merged.
Maybe @assarbad can review as it's closely related to his https://github.com/ned14/pcpp/issues/71#issuecomment-1315961357
also suggests tweaking the CPP_INTEGER
token that this PR eliminates.
I may be missing something obvious...
I see no need for CPP_INTEGER
or CPP_FLOAT
in a pure preprocessor.
Am I missing some use-case that requires phase-7 processing, perhaps.
The test-c
stuff, and all the c-
tests there...
If you authorise the PR action then it should confirm that the tests pass or flag up any issue I may have missed.
Thanks for coding, publishing and maintaining this project. I'd come across it before, forgotten, then found again.
For my use case, I'm actually more interested in a fix for issue #53 ... How easy do you gauge that to be? Maybe I learn some Python file system handling
Thanks for coding, publishing and maintaining this project. I'd come across it before, forgotten, then found again.
For my use case, I'm actually more interested in a fix for issue #53 ... How easy do you gauge that to be? Maybe I learn some Python file system handling
I am also looking at issue #53 myself. And I'll review your patch, but not tonight.
FYI: I looked at your willwray/pcpp:pp_number
branch, @willwray, and tested it against the repro from #71. And lo and behold the issue is gone.
I'll now take a closer look at the actual changes over what's on master
here.
So we have the following bits (correct me if I got it wrong):
.
digit [ == \.\d
]'
digit [ += '\d
]'
nondigit [ += [a-zA-Z0-9_]
]e
signE
signp
signP
sign.
... where:
\d
[\w_]
, which would include non-ASCII letter code points (at least the description seems to include other Unicode letter code points, as per XID_Continue and this)[a-zA-Z0-9_]
NB: interestingly as per the description this does not include other letters than a-z in lower and upper caseThe rest is relatively self-explanatory.
The description you referenced already gives us a lot of leeway. And it may be appropriate to keep it that greedy and that broad. However, with the above remarks in mind I'd like to propose changing it to:
\.?\d(?:\.|[\w_]|'[\w_]|[eEpP][-+])*
This:
'
(was unnecessary)We should probably also define what we expect to be matched here. Admittedly my proposed regex from #71 may be too strict.
My change does not address:
'(?:\d|[a-zA-Z0-9_])
which in the above is merely '[\w_]
What say you, @ned14 and @willwray?
Thanks @assarbad for your thorough detailed review. I'll aim to resolve all comments at the weekend.
For me, if GCC and Wave's preprocessor use the same technique to solve a conformance problem, then it seems appropriate that pcpp can do the same.
Whether the technique is implemented correctly is another matter, but I'm sure we'll get there.
pcpp/evaluator.py
change to p_expression_number
kosher or a horrid hack?The hack is highlighted in the top PR message and it's right up top of the listed Files changed
tab view.
Please take a quick look. It's a few lines diff but doing it differently might take much more careful coding.
It effectively relies on Python exceptions for validating the Value(p[1])
evaluation.
p_expression_number
originally used CPP_INTEGER
as its input match then delegated direct to Value()
:
def p_expression_number(p):
'expression : CPP_INTEGER'
p[0] = Value(p[1])
p_expression_number
now uses CPP_NUMBER
as input match instead:
def p_expression_number(p):
'expression : PP_NUMBER'
try:
p[0] = Value(p[1])
except:
p[0] = p[1]
Is it right and Pythonic to rely on exceptions to do the validation like this?
By the fact that the tests all pass, it appears that Value()
is quite robust in parsing from CPP_NUMBER
instead, and raising an exception if there's a failure to parse out the complete token character string. Perhaps there are more stress-test cases to run against this change.
I don't see a clean alternative way to do this that meshes properly with the grammar and parser., and still allows for pasting to form numeric literals from smaller pp-token parts.
The CPP_NUMBER
regex definition invokes the specification of the pp-number
regular expression that requires XID_Continue
.
This raises further questions:
pcpp
already unicode-enabled?Out of my depth, I asked a couple of questions on WG21 sg16 unicode study group mailing list.
I guess the regex should be extended to parse XID_Continue.
Does anyone here have a clue how to do that?A regex engine that supports UTS#18 (Unicode Regular Expressions) rule RL2.7 would allow matching a character with the XID_Continue property with the syntax \p{XID_Continue}. Unicode Utilities example. Note that the example matches the identifier (but not the mathematical character nor the clown faces).
I don't think the Python regex engine supports that though, so you'll have to find another regex engine or another method to perform the match.
Clang is also going to allow the mathematical profile specified in L2/22-230 in identifiers. C++ may eventually allow them as well.Any other pp-token that should be considered for unicodification?
Some other tokens incorporate an identifier user-defined-literal does.
The Python standard library re
regular expression module notes:
https://docs.python.org/3/library/re.html
See also The third-party regex module, which has an API compatible with the standard library re module, but offers additional functionality and a more thorough Unicode support.
def p_expression_number(p): 'expression : PP_NUMBER' try: p[0] = Value(p[1]) except: p[0] = p[1]
Is it right and Pythonic to rely on exceptions to do the validation like this?
The only un-Pythonic thing I see here is that the exception isn't caught by listing concrete exceptions or at least a base exception type. This is considered sort of an anti-pattern. As an example, this particular code should probably never raise a KeyError
, but if it does it probably signals an actual logic error rather than something you may want to catch. I think in this case probably (but not 100% sure) you are looking for TypeError
. If you have an example with this ambiguity a test could be used to validate your assumptions.
See https://docs.python.org/3/library/exceptions.html#base-classes
Otherwise I don't see much of an issue trying to treat something one way and, failing that, falling back to the alternative. AFAIK nothing un-Pythonic about that.
This raises further questions:
* to what extent is `pcpp` already unicode-enabled? * what other token regex or code might call for unicodification?
As long as you stick to character classes such as \w
/\W
or \s
/\S
and so on, you're fairly safe with Python. It already considers Unicode to some extent. But probably not specifically to the extent of supporting Unicode character category XID_Continue
. At least from what I found \p{categoryname}
isn't supported by re
. The case of t_CPP_ID
for example seems to be approximated well enough.
See also The third-party regex module, which has an API compatible with the standard library re module, but offers additional functionality and a more thorough Unicode support.
I don't know. As an option this would probably be neat, but outright depending on it would not be too nice, given currently the set of external dependencies is quite limited. It's true that this may eventually become a concern, but as we have learned compilers already had to backpedal on allowed Unicode characters to some extent (changing direction and such), because otherwise editors would outright hide code. And as you noted yourself the debate is far from finished, so I think that bridge ought to be crossed when we get there, not necessarily right now.
Also keep in mind that other than for measuring header heft, pcpp probably is used for cases such as amalgamating libraries into single-header. This means latest if the compiler see the results of what pcpp spat out, you'll get the feedback whether something was wrong.
I believe pcpp on Python 3 makes a reasonable attempt at Unicode. C++ has been evolving there though, as has C, and there are differences which are being reconciled. TBH I think a reasonable attempt is reasonable :)
I spent too long yesterday following up on Tom H's unicode links - my conclusion was that the security and stability concerns for languages, around unicode identifiers and exploits, are very unlikely to be concerns for pcpp as a pure preprocessor - it seems ok for pcpp to be lax in what it accepts as identifiers.
(p.s. It looks like that CI failure is on Python 2 due to the added StringIO context manager - I can revert that)
In particular, I like the idea that pcpp accept the mathematical profile identifiers that gcc currently accepts (rejects in -pedantic), clang 14 accepted and now rejects in 15 (but is adding an extension flag to support it) and that look likely to be incorporated in C++ once there's consensus.
I go on annual vacation next week, I'm hoping some time will present itself to review all the open PRs everywhere across my open source. No promises, it's a hectic three weeks coming for me.
Thanks Niall; that'd be a nice surprise - don't stress it though.
I'll aim to understand more of the Value
code and learn from @assarbad's review comment.
This PR I'm only really wanting in order to replace MSVC preprocessor with nicer output (fewer bogus empty lines).
It's the has_include
feature I'm more interested in... :christmas_tree: :gift: :santa:
A simple proof of concept change that fixes #79. With it, pcpp can do codegen using the IREPEAT library.
I believe it's conceptually correct, but my Python may not be; please test this against your suite and review the method (hack) carefully. There's not much code! Mostly deletions.
The change removes CPP_INTEGER, effectively replacing it with PP_NUMBER, and entirely removes CPP_FLOAT as superfluous for preprocessing purposes.
pp-number is sufficient for preprocessing to stage 4
The pp-number regex in the issue is incorrect, lifted from unpublished WG21 https://isocpp.org/files/papers/D2180R0.html "pp-number makes cpp dumber" (best proposal title ever).
Instead, I crafted a regex based on the lastest C++ draft https://eel.is/c++draft/lex.ppnumber#ntref:pp-number which accepts character ' as digit separator:
regex string r'.?\d(.|[\w]|\'[\w]|[eEpP][-+])*'
(also admits binary literals, with digit separator, of course, so they can now be added to the Value parsing code)
Only the conditional evaluator is required to interpret the numbers as integer constant expressions.
This is achieved by hacky means:
The idea is that if the parsed string p[1] can be interpreted as an integer constant-expression Value(p[1]) then do so, otherwise simply pass through the string for possible further pasting and processing.
A robust method might check p[1] against the CPP_INTEGER regex (removed in this commit) for a full match, consuming all input. On the other hand, relying on Value to validate the input while parsing and to raise an exception on failure may be Pythonic.
It seems that pp-number itself is a hack in the standard; I see no way to incorporate pp-number alongside INTEGER and FLOAT tokens meaningful in C; but then there's no need to. Happy Thanksgiving!