jeff-hykin / better-cpp-syntax

💾 The source of VS Code's C++ syntax highlighting
GNU General Public License v3.0
156 stars 29 forks source link

Misc Conversation #185

Closed jeff-hykin closed 5 years ago

jeff-hykin commented 5 years ago

This issue is just a place for chat that doesn't necessarily relate to an issue. I'm going to close it immediately so that it doesn't add to the normal issue count, but it can still be used for general conversation.

jeff-hykin commented 5 years ago

Sorry I'm not more active. I'm working finishing up some other responsibilities before the next semester starts so it might be another week.

jeff-hykin commented 5 years ago

Saturday will prob be the next time I get to review branches and look at the other issues.

Doxygen and peaceshi's recent request for more pre-defined Microsoft macros, I think we should come up with an official list of the standards and C++ versions we support, and their priority if they ever conflict along with URLs to their specs. (C++ 2020, GNU, Microsoft, doxygen, emac's banners, etc)

matter123 commented 5 years ago

So I'm thinking something like.

  1. Current C++ Standard (C++17):
  2. Current draft standard
  3. Compiler vendor extensions sorted by popularity
    • 367

  4. Documentation comment standards sorted by popularity/ease of implementing
    • 353

  5. Conventions of popular c++ projects
    • 8, #331

I'll try to find example issues/ references for each in a bit.

jeff-hykin commented 5 years ago

I'll try to catch up on #368 and #339 tomorrow (edit: make that #359 as well). I probably should've focused on #339 instead of the LaTeX branch

jeff-hykin commented 4 years ago

For the gem, do you @matter123 have any suggestions for how to incorporate the TokenHelper functionality?

I was thinking that the keywords and token names could simply be embedded in the pattern, but there is a challenge when a separate pattern needs to be anything except tokens with specific qualities. We can add the TokenHelper as-is to the gem/lib, but I wanted to see if you had any ideas for a better solution. I'm going to change the functionality of the TokenHelper to not pollute the global namespace (it is currently changing the negation operator for Symbol)

matter123 commented 4 years ago

TokenHelper is certainly a good idea, but I'm not sure it's actually related to grammer generation. I'm thinking it should be a separate gem.

jeff-hykin commented 4 years ago

Well I do think it would be good to have one gem as an all-in-one toolbox for making a textmate grammar. I'm pretty sure most grammars have keywords that are going to need to be negatively searched for in multiple places. The regex helpers pattnern.or(pattern) technically aren't really part of the grammar object either and I've though about making it into a separate regex library (unrelated to textmate) but I think that should say bundled as well. We could make a modular system: grammar gem, pattern/regex gem, token-helper gem, testing-gem and then a textmate_toolbox bundled combining all four.

What would you think if the tokens were actually more integrated by saving them on the grammar object?

jeff-hykin commented 4 years ago

Also I thought I'd let you know, I've been working on some tutorial blog posts to publish with the release of the gem to make it easier for people to contribute. It's prompted a decent number of design decisions/problems, so I created a better-c-syntax repo as a WIP place for how an ideal repo would be setup using the library. Right now it is very barebones.

I've been playing with git submodules, and I think they might be the best way to handle multi-language patterns. So for example the better-c-syntax has a imports folder where is has this (cpp-textmate-grammar) repo as a submodule. That way, it can import functions from a specific version of the C++ repo and changes that happen on the C++ repo don't have to affect C. Whenever desired, the C repo can easily pull the latest changes from C++ though if there are bugfixes. The imports don't have to be languages either, we could create a shared pattern repo that both C++ and C could pull from.

matter123 commented 4 years ago

I'm not actually sure how well the grammar can be separated from the patterns. The grammar and it's plugins rely a lot in functions of the patterns. By testing gem do you mean the integration or unit tests. The integration tests kind of have to be JavaScript as as far as I know, the only open source textmate grammar parsers are in JavaScript (vscode-textmate, firstmate). Making the integration tester a separate npm package sounds reasonable, there are only a few grammar testers and iirc none are automated.

The grammar storing tokens sounds like a good idea.

I'll check out better-c-syntax.

jeff-hykin commented 4 years ago

Sounds good, I'll work on refining/simplifying the token helper and moving it into the grammar. It might be a good idea for the tokens to be created near their relevant code and imported to other pieces that need it so I'll consider that.

By testing I did mean integration, and yes I agree it will have to be javascript. However, there are some options. The ruby gem prompt the user to install the javascript package through npm, but even more simple than that, there's a decent number of ways to package Javascript into a cross platform executable that could be shipped with the ruby gem.

I realize I should maybe clarify, the pattern-matching of the better-c-syntax is what is barebones but the folder structure has most all of the pieces (although I'm still very uncertain that it is the best format). I feel that, whatever structure is decided upon might last for a long time so I hope to make it future-proofed and extendable. If languages are going to be importing other languages, it will be quite useful to have a standard folder structure.

jeff-hykin commented 4 years ago

Sorry I've been gone for awhile @matter123. I'll probably back some this next week for merging and catching up on everything.

jeff-hykin commented 4 years ago

I came up with a way to integrate the TokenHelper last night. (Current code is broken but I'll push the local fix later today).

Instead of tokens being separate objects: I figured lets just add the functionality into Pattern objects. Patterns can have adjectives that allow them to be collected in groups.

grammar[:addition_operator] = Pattern.new(
    match: /\+/,
    adjectives: [ :operator, :arithmeticOperator ],
)

To retrieve/collect "tokens" I created an abbreviated syntax. I'm a little unsure that this is the best way, but I couldn't find different concise way to do it without hacking into the existing built in classes.

all_arithmetic_operators = grammar[/arithmeticOperator/]

all_other_operators = grammar[/operator && !arithmeticOperator/]

It uses placeholders so that the tokens are resolved at the very end, and because of that (and because it's global) it only searches through patterns that are directly named on the grammar repository. I'm not sure if that will be unintuitive though, so that behavior might need to be changed.

I'll probably post something later today going into more detail.

jeff-hykin commented 4 years ago

@matter123 I could use some help on the oniguruma engine.

The replacement system for tokens is almost done. I was able to use regex conditionals to implement an awesome feature dont_match: that will replace a lot of the negative lookbehinds. The regex conditionals work in VS Code itself, but they're failing with npm run gen.

I updated oniguruma and vscode-textmate but it still fails. I'm a bit confused since I thought that was the same engine VS Code itself was using.

matter123 commented 4 years ago

So vscode just (a few weeks ago) switch from oniguruma to oniguasm-wasm for their regex engine, this updated the version of the engine that is being used. Im pushing the branch Misc/update/vscode-textmate with an initial update of that.

I'm going to dinner but I can work some more in about an hour.

jeff-hykin commented 4 years ago

Cool! I'm glad you knew what was up. Take your time with dinner 👍

matter123 commented 4 years ago

@jeff-hykin I just pushed the minimal amount of work needed to make npm run gen work.

Of note, npm run perf is no longer functional as the vscode-oniguruma module does not export OnigRegExp needed by test/source/report/onig_scanner.js. The solution to this is to pin the version of https://github.com/NeekSandhu/onigasm to be the same as what vscode-oniguruma uses internally.

jeff-hykin commented 4 years ago

Thanks! That did the trick for Misc/tokens, it all generates fine now. I'm glad you merged it too cause I had been attempting unsuccessfully to get it working based off that repo (https://github.com/NeekSandhu/onigasm).

Thanks for the note about perf too.

matter123 commented 4 years ago

Quickly looked at the Multi/tokens branch, Looks Good. I left a few comments, but I'll give a more thorough code review when the PR is made.

matter123 commented 4 years ago

The most recent push to Misc/update/vscode-textmate restores the ability to run npm run perf, I decided against using onigasm as their embedded version of onigurua is 6.8.2 while vscode-onigurua's embedded version is 6.9.5. Instead I utilized the ace that that OnigScanner with a single pattern acts very similarly to OnigRegExp.

The new version is however noticeably slower (~18.3s for previous, ~23s for new), I am unsure what the cause is.

jeff-hykin commented 4 years ago

Quickly looked at the Multi/tokens branch, Looks Good. I left a few comments, but I'll give a more thorough code review when the PR is made.

Sounds good, the whole branch is definitely a big WIP. You might want to look at the latest cpp/generate.rb to see the usage of the new selector system and if there's any design concerns. I ran into some small design problems, but overall I think its really helping cleanup the code. Outside of that, I'll probably have --Jeff in the comments on anything that is notable.

As a side note: when there are grammar-usage bugs, some of them can be really hard to debug, they fail far away from their source with an unrelated error. So I might work on adding more checks, logging, and descriptive error messages before release. Examples:

./patterns/comments.rb

grammar.exports = [
    :comment, # NOTE: no "s" AKA not :comments
    :inline_comment,
]

generate.rb

grammar.import("./patterns/comments.rb")
# NOTE: there is an "s"
grammar[:comments] # returns a placeholder even though it shouldn't 
grammar[:comments] = grammar[:comments] 
matter123 commented 4 years ago

should the grammar complain if an imported grammar's exports end up not being used?

as far as infinite recursion in placeholders, do you think it is worthwhile to implement some sort of cycle detection algorithm?

I'll see if I can figure out what's going on with dont_match.

jeff-hykin commented 4 years ago

I'll see if I can figure out what's going on with dont_match.

I did make partial progress on this. /(?<blah_group>)()\g<1>/ throws a "numbered backref/call is not allowed. (use name)" and the dont_match uses named capture groups. So in the start_match_empty.rb linter when it goes from a string to regex Regexp.new(pattern.start_pattern.evaluate.gsub("\\G", '\uFFFF')) it throws the error

Idk if Onigurua also will throw an error with numbered backref's and named capture groups, but it probably will since its Ruby based. Idk why that wouldn't be allowed in Ruby either.

jeff-hykin commented 4 years ago

The most recent push to Misc/update/vscode-textmate restores the ability to run npm run perf, I decided against using onigasm as their embedded version of onigurua is 6.8.2 while vscode-onigurua's embedded version is 6.9.5. Instead I utilized the ace that that OnigScanner with a single pattern acts very similarly to OnigRegExp.

Thanks 👍 I'll probably test it out in the next few days to see exactly what you mean with the versions

jeff-hykin commented 4 years ago

as far as infinite recursion in placeholders, do you think it is worthwhile to implement some sort of cycle detection algorithm?

Yeah, we can make a placeholder set {} that is empty at the top level. Then when a placeholder is being resolved it adds the PlaceholderPattern object to the set, and if its already on the set, then its an infinite recursion.

matter123 commented 4 years ago

slightly simpler might be to borrow from dfs mapping and mark each placeholder node as ("not resolved", "resolved", or "resolving") and attempting to resolve a "resolving" placeholder is a loop. Otherwise I belive you would have to clear the set between resolves to prevent false positives on already resolved patterns.

jeff-hykin commented 4 years ago

as far as infinite recursion in placeholders, do you think it is worthwhile to implement some sort of cycle detection algorithm?

There might also be a way to do it before any loop/recursion by changing the resolve strategy. Every pattern in the repository, like grammar[:thing] has to point to a real pattern by the end. If grammar.repository[:thing_being_resolved] isn't a real pattern by the time the grammar is being compiled, an error could be thrown then.

Something like

# Grammar#compile_to_json
for k,v in grammar.repository
       raise if v.isPlaceholder
end
matter123 commented 4 years ago

So looking at the new resolve_placeholder transformation, I am not sure how it is recursive.

if repository[arguments[:placeholder]] is itself a placeholder, wouldn't it be a placeholder by the end?

Imagine the following grammar:

g[:a] = Patterb.new(/a/).then(g[:b])
g[:b] = Pattern.new(/b/).then(g[:c])
g[:c] = Pattern.new(/c/)

Prior to calling save the grammar has the following

{
a: Pattern.new(/a/).then(PlaceholderPattern(:b))
b: Pattern.new(/b/).then(PlaceholderPattern(:c))
c: Pattern.new(/c/)
}

Save_to is then called resolving in the resolve_placeholder plugin running. It encounters :a and rewrites :a to Pattern.new(/a/).then(Pattern.new(/b/).then(PlaceholderPattern(:c))) When it encounters :b the placeholder for :c is resolved but that isn't propagated back up to :a

Edit: does #map! end up traveling through the new :match and resolve them?

jeff-hykin commented 4 years ago

if repository[arguments[:placeholder]] is itself a placeholder, wouldn't it be a placeholder by the end?

I'm not 100% sure

repository[arguments[:placeholder]] = PlaceholderPattern.new(*args)

unless repository[arguments[:placeholder]].is_a? PatternBase
    raise ":#{arguments[:placeholder]} is not a pattern and cannot be substituted"
end
#
# infinite __deep_clone__ recursion I think
#
each_pattern_like.arguments[:match] = repository[arguments[:placeholder]].__deep_clone__
jeff-hykin commented 4 years ago

Edit: does #map! end up traveling through the new :match and resolve them?

This I'm not sure of either, which is partly why I worked on creating .recursive_pattern_chain so that a list could be generated (and basically frozen) first before starting to resolve/transform things

matter123 commented 4 years ago

I don't think it can be frozen, several transformations alter the chain fairly significantly.

jeff-hykin commented 4 years ago

I don't think it can be frozen, several transformations alter the chain fairly significantly.

Sorry I meant just top-level. Basically no elements are added/removed/re-ordered on the list, but each can be mutated. Is that what you meant or are you saying things do get added dynamically?

jeff-hykin commented 4 years ago

I pushed the recursive bug on master on the experimental better-c-syntax repo if you want to see the stack overflow bug. The grammar only has 1 pattern in it at the moment. I'm still not sure how its being caused by repository[arguments[:placeholder]].__deep_clone__ so I imagine it might have to do more with #map!

matter123 commented 4 years ago

Freezing at the top level should cover any existing plugins but ideally, bailout would be a part of the grammar and that is going to add more names to the grammar. There could be two stages, in the first stage, the pattern objects are immutable but the repository is mutable and in the second stage its the opposite.

jeff-hykin commented 4 years ago

Freezing at the top level should cover any existing plugins but ideally, bailout would be a part of the grammar and that is going to add more names to the grammar. There could be two stages, in the first stage, the pattern objects are immutable but the repository is mutable and in the second stage its the opposite.

Yeah that might work since the bailout should be independent of any placeholders or other resolutions

matter123 commented 4 years ago

I pushed the recursive bug on master on the experimental better-c-syntax repo if you want to see the stack overflow bug. The grammar only has 1 pattern in it at the moment. I'm still not sure how its being caused by repository[arguments[:placeholder]].__deep_clone__ so I imagine it might have to do more with #map!

is grammar[:comments] = grammar[:comments] a simplification of a more complex recusion issue, or is semantically meningful?

jeff-hykin commented 4 years ago

Well a few things happened. (I pushed 1 more change (actually a revert))

  1. The code was more removed than that, so it could look more like comment = grammar[:comments], followed later by grammar[:comments] = comment
  2. It was originally using imports grammar[:comments] = Grammar.import("./patterns/comments.rb")[:comments]
  3. There is a misspelling, it was supposed to be grammar[:comments] = Grammar.import("./comments.rb")[:comment]

The misspelling is important cause it tried to access a non-exported pattern

matter123 commented 4 years ago

I pushed a change that elminates the issue with the recusion in deep clone. With the change this is the new error for better-c-syntax:

Traceback (most recent call last):
    13: from main/generator.rb:75:in `<main>'
    12: from /Users/fosdick/cpp-textmate-grammar/gem/lib/textmate_grammar/grammar.rb:495:in `save_to'
    11: from /Users/fosdick/cpp-textmate-grammar/gem/lib/textmate_grammar/grammar.rb:329:in `generate'
    10: from /Users/fosdick/cpp-textmate-grammar/gem/lib/textmate_grammar/grammar.rb:260:in `run_pre_transform_stage'
     9: from /Users/fosdick/cpp-textmate-grammar/gem/lib/textmate_grammar/grammar.rb:260:in `each'
     8: from /Users/fosdick/cpp-textmate-grammar/gem/lib/textmate_grammar/grammar.rb:261:in `block in run_pre_transform_stage'
     7: from /Users/fosdick/cpp-textmate-grammar/gem/lib/textmate_grammar/grammar.rb:261:in `transform_values'
     6: from /Users/fosdick/cpp-textmate-grammar/gem/lib/textmate_grammar/grammar.rb:278:in `block (2 levels) in run_pre_transform_stage'
     5: from /Users/fosdick/cpp-textmate-grammar/gem/lib/textmate_grammar/transforms/resolve_placeholders.rb:10:in `pre_transform'
     4: from /Users/fosdick/cpp-textmate-grammar/gem/lib/textmate_grammar/pattern_variations/base_pattern.rb:162:in `map!'
     3: from /Users/fosdick/cpp-textmate-grammar/gem/lib/textmate_grammar/transforms/resolve_placeholders.rb:21:in `block in pre_transform'
     2: from /Users/fosdick/cpp-textmate-grammar/gem/lib/textmate_grammar/pattern_variations/base_pattern.rb:645:in `=='
     1: from /Users/fosdick/cpp-textmate-grammar/gem/lib/textmate_grammar/pattern_variations/base_pattern.rb:640:in `eql?'
/Users/fosdick/cpp-textmate-grammar/gem/lib/textmate_grammar/pattern_extensions/placeholder.rb:41:in `to_tag': Attempting to create a tag from an unresolved placeholder `:comments' (RuntimeError)
jeff-hykin commented 4 years ago

Based on that doc you sent sent earlier. Should we be using \k for backreferences instead of \g?

matter123 commented 4 years ago

backreferences are \N where N is a number. \g is used for subexpressions.

jeff-hykin commented 4 years ago

Does the cpp code have subexpressions?

jeff-hykin commented 4 years ago

I thought \N was problematic for values > 9 so \g<N> had to be used instead

matter123 commented 4 years ago

just the template angle brackets, but that's planning on being removed?

matter123 commented 4 years ago

I thought \N was problematic for values > 9 so \g had to be used instead

the docs say n is valid when (n >= 1)

jeff-hykin commented 4 years ago

Oh yeah I forgot I used it there, maybe that's where I'm seeing the \g from.

As a side note, a lot later on, we could probably use subexpressions to compress and maybe optimize the regex for single patterns based on how they reuse patterns inside themself

jeff-hykin commented 4 years ago

I thought \N was problematic for values > 9 so \g had to be used instead

the docs say n is valid when (n >= 1)

Hmm I wonder how you're supposed to match a backreference followed by a digit literal. Non capture groups?

\11 vs \1(?:1)

That could create some edge case bugs, depending on how we replace the backreferences

$backref(1)1 Converted to \11 Instead of \1(?:1)

matter123 commented 4 years ago

Backreferences actually become (?:\1)1

RedCMD commented 2 years ago

I guess just for future reference: \N, \k<N> is a backreference; with N >= 1. Will consume all digits. Cannot have leading 0's \\011 is the char point 9 (tab), not capture group 11 https://github.com/kkos/oniguruma/blob/d484916f3ca2b6bbc81accba63054625dfe26b6b/doc/RE#L411

\N is octal; but only 1-3 digits, must only be [0-7] and must be atleast 1 higher than the amount of capture groups available. It does not consume any literal digits after it https://github.com/kkos/oniguruma/blob/d484916f3ca2b6bbc81accba63054625dfe26b6b/doc/RE#L24

Has there been any development in the performance loss of applying a pattern inside a capture item in the C highlighter? https://github.com/microsoft/vscode-textmate/issues/167

jeff-hykin commented 2 years ago

\k<N> is a backreference

@RedCMD Yeah I probably should've known. I went ahead and updated the grammar builder library with that fix.

Has there been any development in the performance loss of applying a pattern inside a capture item in the C highlighter?

Sadly no