vickenty / lang-c

Lightweight C parser for Rust
Apache License 2.0
202 stars 30 forks source link

Optionally switch Identifiers form std::String to a different type in order to support string interning #30

Open nacaclanga opened 3 years ago

nacaclanga commented 3 years ago

Hello,

as you probably are aware of, many compilers use string interning on identifier and type name tokens in order to avoid heap allocation and to allow for faster equality comparison between identifier names. Generally Strings are interned during tolkinization. Unfortunately lang-c currently only allows for identifiers and type names to be stored in std::String. As I don't want to change the default behaviour, introduce dependencies on external crates or bless a particular string interning library, I came up with a solution where the ast and parser are converted into generic functions. The only issue with this solution is, that I somehow have to work around rust-peg's limitations. This is done by some string substitution on the parser.rs.raw. While I tried my best to modify the code in a fashion, that avoids breaking the public API, I cannot guarantee, whether I succeeded.

I am fully aware, that this use of rust-peg could be considered a hack and you might be reluctant to merge this. Please fell free to close this pull request in this case. Given that you considered merging https://github.com/vickenty/lang-c/pull/18 and I wasn't certain, how this would effect my string substitution, I also made a version, where this change is applied, https://github.com/nacaclanga/lang-c/tree/generic_ident_conditional, which you could consider merging instead.

Merry Christmas

vickenty commented 3 years ago

Hi. Thank you for making this pull request.

Just out of curiosity, did you make any benchmarks to compare performance with and without an interner? I wonder what kind of impact this would have.

Do you have an example how to plug an interner into this using the new trait? I think not having a receiver in Name::get_from_str might be too restrictive, and force the interner to be stored in a thread-local or a global static.

I would be also fine just vendoring a simple interner, given it is justified in terms of performance.

nacaclanga commented 3 years ago

Hi, yes I was using a lasy_static RwCell global for the reciver, which isn't a very flexible approach. I now reworked my commit in such a way that the the parse function takes a interner by mutable reference that is then used to to intern the symbols. I did not do any benchmarks so far.

vickenty commented 3 years ago

I did some quick tests with dump (see 28fca4b34e3f48b82b786ad695886e781eed6fd2) and on a randomly generated source file. I got a modest ~4% improvement in instruction counts (664M vs 690M IRs), and a slightly larger ~10% improvement in memory usage (27Mb vs 30Mb). This is on parsing only, there isn't any kind of AST traversal.

Feel free to do your own benchmarks and add results here. I suspect benefits may be larger for the code that actually makes use of the AST.

But as it is now, to be honest, this doesn't seem to me like a big enough improvement to justify all the effort and the added complexity of making AST fully generic, and the extra work necessary to merge this (e.g. printer support).