ned14 / pcpp

A C99 preprocessor written in pure Python
Other
220 stars 41 forks source link

preprocessor: preserve line continuations #45

Closed MathieuDuponchelle closed 4 years ago

MathieuDuponchelle commented 4 years ago

Instead of concatenating lines ended with a backslash for further lexing, preserve those by considering them as just another type of whitespace. This causes yielded lex tokens for multiline macros to have an accurate reported line number.

The changes in the reference output for tests are legitimate.

MathieuDuponchelle commented 4 years ago

Actually, I don't know if the changes in the reference output are legitimate, or if line continuations should be stripped out at a later stage? Let me know if I should do that, for context I'm currently investigating whether pcpp could be used as a replacement for the macro / comment scanner in gobject-introspection (https://gitlab.gnome.org/GNOME/gobject-introspection/-/issues/346), and one feature that I consider a requirement is the ability to serialize object macros / function macros verbatim in the output XML, eg given:

#define foo() { \
  bar(); \
}

I would like the output xml to look similar to:

  <function-macro name="foo">{
  bar(); \
}</function-macro>
ned14 commented 4 years ago

Your PR as it currently is breaks conformance of output with clang's and GCC's preprocessors. The n_std-clang.i and nstd-gcc.i files are what those two preprocessors output. As you can see, n_std-pcpp.i is fairly close. You should be aware that I receive a lot of open issues and PRs to fix even the slightest deviation in whitespace from GCC or clang. Obviously a great number of people care a lot about whitespace being perfect, and not folding backslashes would be a showstopper for them.

That said, I'm not unsympathetic to your use case. I'd suggest not treating \ as whitespace, that could be problematic (imagine backslash character escapes). I'd suggest a new token type just for backslash + newline, making sure it can't collide with anything else. You'd then have to do a bunch of work to ensure that backslash + newline token acts mostly like whitespace, except where it doesn't, and then in the final rendition does collapse into whitespace without affecting collapse of neighbouring whitespace in any way. Then you'd add an option to inhibit that final collapse.

I'll be honest, I think that's a lot of work when you have an easier alternative: just apply clang-format to the extracted preprocessed output. That'll reformat your macros into the shape you had above i.e. backslashes after lines. It won't be one-one with the source, but it'll be probably even better from a docs generation perspective because it'll make bad original formatting pretty. I'll leave the choice up to you.

MathieuDuponchelle commented 4 years ago

Your PR as it currently is breaks conformance of output with clang's and GCC's preprocessors.

Yes that's what I suspected right after I posted it, hence my follow-up comment :)

Obviously a great number of people care a lot about whitespace being perfect, and not folding backslashes would be a showstopper for them.

Right, I don't want to break others' use cases of course.

That said, I'm not unsympathetic to your use case.

cool :)

You'd then have to do a bunch of work to ensure that backslash + newline token acts mostly like whitespace, except where it doesn't, and then in the final rendition does collapse into whitespace without affecting collapse of neighbouring whitespace in any way.

That sounds in line with what I had in mind (stripping line continuations at a later time), though my perhaps naive hope was that it wouldn't prove too difficult. I'll poke around some more.

Then you'd add an option to inhibit that final collapse.

For what it's worth, that's not actually an option I'd need, my current approach is to run Preprocessor.parse(), consume all the tokens, then iterate Preprocessor.macros, without calling Preprocessor.write().

I'll be honest, I think that's a lot of work when you have an easier alternative: just apply clang-format to the extracted preprocessed output. That'll reformat your macros into the shape you had above i.e. backslashes after lines. It won't be one-one with the source, but it'll be probably even better from a docs generation perspective because it'll make bad original formatting pretty. I'll leave the choice up to you.

That's an interesting point, but I think the correct approach at the gobject-introspection level is to serialize (as close as possible to) the original source, leaving formatting decisions up to tools further downstream. I'll think about it if the alternative proves too annoying / intrusive to implement, thanks :)

MathieuDuponchelle commented 4 years ago

I implemented your suggestion, the test suite now passes without modifications. I have also added a simple commit to propagate source / lineno information to the Macro instances, it is trivial so I figured I could keep it in that MR, let me know if you'd rather I make a separate one :)

MathieuDuponchelle commented 4 years ago

loks good to you @ned14 ?

ned14 commented 4 years ago

From a quick scan, it seems reasonable. But I'll need to find an hour somewhere to properly study it. That can take a while, alas working on pcpp cannot be claimed as work time in any capacity, so it needs to come from when I am not tired, in my free time.

MathieuDuponchelle commented 4 years ago

From a quick scan, it seems reasonable. But I'll need to find an hour somewhere to properly study it. That can take a while, alas working on pcpp cannot be claimed as work time in any capacity, so it needs to come from when I am not tired, in my free time.

Makes sense, I'm also doing this on my free time fwiw :)

MathieuDuponchelle commented 4 years ago

hey @ned14 , feeling like giving this a look again? :)

ned14 commented 4 years ago

hey @ned14 , feeling like giving this a look again? :)

It's not that I forgot you, it's more I have had other much higher important priorities. Basically my principle public server had a catastrophic hardware failure, all data lost. That was ten days ago, I have so little free time right now that I only got email fully functional again last night around 2am. But don't worry, I will get to this eventually, just be aware it may be when my children go back to school in September. After that, my free time situation will very dramatically improve.

MathieuDuponchelle commented 4 years ago

No problem, didn't mean to be pushy sorry, I hope things sort themselves out on your end eventually :) It's not a blocker for me as replacing the source scanner in gobject-introspection is a large undertaking anyway, and not something I get paid to do either, it is more of a long term project that should have many benefits in terms of maintenance, and a fun side project to hack on in my free time really.

I put up my WIP code at https://gitlab.gnome.org/GNOME/gobject-introspection/-/merge_requests/231 in case you're interested when you find free time again :)

ned14 commented 4 years ago

Honestly you've not been pushy at all. I've not been processing anything anywhere in the last few weeks. My wife takes her exams this week, so recently any time I am not at the day job I am on childcare taking them out of the house so she can study. It is what it is, just a bad time of year for "everything else". Once into September, I'll start clearing the long backlog of all open source PRs, issues, and email, including yours.

ned14 commented 4 years ago

I've substantially reworked this PR. Would it be possible for you to check if what I merged still works exactly for what you needs? Once again sorry for the very long processing time, my kids went back to school yesterday, so I finally got to clearing this amongst many PRs against my open source.

ned14 commented 4 years ago

No feedback after a month, closing.

MathieuDuponchelle commented 4 years ago

Erg @ned14 my turn to be sorry, I thought I'd answer later and never did, I will test your patch shortly