jeff-hykin / better-cpp-syntax

💾 The source of VS Code's C++ syntax highlighting
GNU General Public License v3.0
154 stars 29 forks source link

Publish a textmate gem #290

Open jeff-hykin opened 5 years ago

jeff-hykin commented 5 years ago

I'm creating this as a checklist for publishing the gem

matter123 commented 5 years ago

We may want to reevaluate generating *.tmLanguage.yaml files. With, afaik, no text editor or tool supporting yaml language, and with the JSON files being pretty-printed, I not sure what their purpose is.

jeff-hykin commented 5 years ago

yeah. I think its safe to remove them, I only added yaml because I didn't pretty print the json.

matter123 commented 5 years ago

Fairly minor concern but

create a Pattern class instead of having newPattern use regex

would prevent /foo/.maybe(bar) from working, correct?

If so, there are 33 instances of that that would need to be replaced.

jeff-hykin commented 5 years ago

Yeah, and that was the original reason that the pattern class was never created. If I made it a class the idea would be to clean everything up and let it be symbolic, and then only convert it to regex once to_tag was called. But I'm not sure it's worth all the trouble. It'd require a lot of library re writing just to make it a bit more straightforward under the hood

jeff-hykin commented 5 years ago

I've also been debating the snake case switch. I prefer static data and like-static-data to be snake case and functions to be camel, but Ruby uses snake for both. If I switched it would only be for formatting consistency.

matter123 commented 5 years ago

It seems the majority of the uses for starting with a regexp is looking for a literal (17), with anchors being the nxt largest (6).

matter123 commented 5 years ago

The warning for newlines is probably too overly broad to be useful. In particular, the new include pattern trips up my initial warning.

Output:

There is a pattern that likely tries to match characters after \n
textmate grammars only operate on a single line, \n is the last possible character that can be matched.
Here is the pattern:
((?:^)((?:(?:(?>\s+)|(\/\*)((?>(?:[^\*]|(?>\*+)[^\/])*)((?>\*+)\/)))+?|(?:(?:(?:(?:\b|(?<=\W))|(?=\W))|\A)|\Z)))((#)\s*((?:(?:include|include_next)|import))\b)\s*(?:(?:(?:((<)[^>\n]*(>?)((?:(?:(?>\s+)|(\/\*)((?>(?:[^\*]|(?>\*+)[^\/])*)((?>\*+)\/)))+?|(?:(?:(?:(?:\b|(?<=\W))|(?=\W))|\A)|\Z)))(?:(?:\n|$)|(?=\/\/)))|((\")[^\"\n]*(\"?)((?:(?:(?>\s+)|(\/\*)((?>(?:[^\*]|(?>\*+)[^\/])*)((?>\*+)\/)))+?|(?:(?:(?:(?:\b|(?<=\W))|(?=\W))|\A)|\Z)))(?:(?:\n|$)|(?=\/\/))))|((?:[a-zA-Z_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))(?:[a-zA-Z0-9_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))*((?:(?:(?>\s+)|(\/\*)((?>(?:[^\*]|(?>\*+)[^\/])*)((?>\*+)\/)))+?|(?:(?:(?:(?:\b|(?<=\W))|(?=\W))|\A)|\Z)))(?:(?:\n|$)|(?=\/\/))))|((?:(?:(?>\s+)|(\/\*)((?>(?:[^\*]|(?>\*+)[^\/])*)((?>\*+)\/)))+?|(?:(?:(?:(?:\b|(?<=\W))|(?=\W))|\A)|\Z)))(?:(?:\n|$)|(?=\/\/))))

Newline detection:

# check for matching after \n
if /\\n(.*?)(?=!\||\\n)/ =~ regex_as_string
    if /[^\^$\[\]\(\)?:+*=!<>\\]/ =~ $1

        puts "\n\nThere is a pattern that likely tries to match characters after \\n\n"
        puts "textmate grammars only operate on a single line, \\n is the last possible character that can be matched.\n"
        puts "Here is the pattern:\n"
        puts regex_as_string
        # raise "Error: see printout above"
    end
end
matter123 commented 5 years ago

Does

check/fix the tag_as: in oneOrMoreOf() or zeroOrMoreOf()

Refer to the fact that the pattern below will only give the last foo the scope of test.foo? If so, I think adding a warning about that would harm usability, composing patterns should be able to work without having to call turn_of_numbered_capture_groups whenever you want to match the pattern multiple times.

grammar[:test] = newPattern(
    match: oneOrMoreOf(newPattern(match:/foo/, tag_as:"test.foo").maybe(@spaces)),
    tag_as: "test"
)
jeff-hykin commented 5 years ago

Does

check/fix the tag_as: in oneOrMoreOf() or zeroOrMoreOf()

Refer to the fact that the pattern below will only give the last foo the scope of test.foo? If so, I think adding a warning about that would harm usability, composing patterns should be able to work without having to call turn_of_numbered_capture_groups whenever you want to match the pattern multiple times.

grammar[:test] = newPattern(
    match: oneOrMoreOf(newPattern(match:/foo/, tag_as:"test.foo").maybe(@spaces)),
    tag_as: "test"
)

Yeah that was the problem I was referring to. I was thinking of automatically disabling the tags that are inside the oneOrMoreOf() or zeroOrMoreOf() and then automatically adding an includes: that has the repeated pattern one time, which I believe will then correctly tag all of the repetitions.

jeff-hykin commented 5 years ago
\\n(.*?)(?=!\||\\n)

I was thinking it would be a warning just like the zero-length match warning. Its given, but it can be disabled because sometimes it's valid.

That said I'm not sure why there's a [^>\n] inside the pattern. There's other \n's too but [^>\n] should just be [^>]

jeff-hykin commented 5 years ago

Fairly minor concern but

create a Pattern class instead of having newPattern use regex

would prevent /foo/.maybe(bar) from working, correct?

If so, there are 33 instances of that that would need to be replaced.

We could just do

class Pattern < Regexp
    def initialize(arguments)
        self.then(arguments)
    end
end

That we could use Pattern.new in the guides/explanations to be consistent with PatternRange.new()

jeff-hykin commented 5 years ago

I think theres no way to do the use the turnOffNumberedCaptureGroups on every regex pattern without having a refactored Pattern class. Right now there's no way to fully tell the difference between pure regex and pattern-combined regex.

The next closest thing would be to count the number of capture groups there should be, and the number of capture groups their actually are.

The to_s method (for printing out a pattern the same way it looked in the code) can't be done without a refactored Pattern class either.

matter123 commented 5 years ago

That we could use Pattern.new in the guides/explanations to be consistent with PatternRange.new()

That sounds good.

I was thinking it would be a warning just like the zero-length match warning. Its given, but it can be disabled because sometimes it's valid.

That said I'm not sure why there's a [^>\n] inside the pattern. There's other \n's too but [^>\n] should just be [^>]

I'll clean up the include pattern, add the ability to disable the warning and push that branch.

matter123 commented 5 years ago

Another check that might be worth considering is wrong ordering of patterns with a common prefix.

If a pattern set contains the following patterns:

...
[A-Z_]+       # pattern 1
[A-Z_]+\s+foo # pattern 2
...

Pattern 2 will never match because it has a common prefix with pattern 1 and pattern 1 is first.

jeff-hykin commented 5 years ago

Another check that might be worth considering is wrong ordering of patterns with a common prefix.

If a pattern set contains the following patterns:

...
[A-Z_]+       # pattern 1
[A-Z_]+\s+foo # pattern 2
...

Pattern 2 will never match because it has a common prefix with pattern 1 and pattern 1 is first.

Do you think this is detectable? I guess check for a perfect subset at the beginning of two patterns.

matter123 commented 5 years ago

Yeah,

for any pattern A of the pattern set 2..N:
    for each pattern B in the range 1..A-1:
        if B is a prefix of A, issue a warning/error

It would need some handling for extra parens, but even without that, it should catch the common case of [patternA, patternA.then(patternB)].

matter123 commented 5 years ago

the zeroLengthStart is a bit over-cautious, in particular, \G is usually intentional, and as such \G should not be enough to trigger the warning.

One possible solution is to do what vscode-textmate does when attempting to not evaluate \G, that is, replace \G with \uFFFF.

matter123 commented 5 years ago

Depending on whether you plan on supporting generating atom themes, you may want to split tag_as by the space character and generate multiple captures. See https://github.com/microsoft/vscode-textmate/issues/108#issuecomment-522197256

matter123 commented 5 years ago

It's somewhat of an odd case, but should the grammar generator detect when you are using a start of line anchor inside a sub match, and then rewrite it so that it works?.

Example: Orignal (non working):

Pattern.new(
  match: /\W+/,
  includes: [
    Pattern.new(match: /^bar/, tag_as: "bar")
  ]
)

Rewritten (working)

Pattern.new(
  match: /\W+/,
  includes: [
    PatternRange.new(
      start_pattern: lookAheadFor(/./),
      end_pattern: /$/,
      includes: [
        Pattern.new(match: /\Gbar/, tag_as: "bar")
      ]
    )
  ]
)

It is fairly unintuitive that /^/ doesn't work inside sub captures. And equally unituitive that the solution is to wrap it up in a PatteernRange.

jeff-hykin commented 5 years ago

Depending on whether you plan on supporting generating atom themes, you may want to split tag_as by the space character and generate multiple captures. See microsoft/vscode-textmate#108 (comment)

I think this is a perfect situation for the library to handle (to support Atom), but it is low priority so we should put the effort in after the gem is published.

jeff-hykin commented 5 years ago

It's somewhat of an odd case, but should the grammar generator detect when you are using a start of line anchor inside a sub match, and then rewrite it so that it works?.

Example: Orignal (non working):

Pattern.new(
  match: /\W+/,
  includes: [
    Pattern.new(match: /^bar/, tag_as: "bar")
  ]
)

Rewritten (working)

Pattern.new(
  match: /\W+/,
  includes: [
    PatternRange.new(
      start_pattern: lookAheadFor(/./),
      end_pattern: /$/,
      includes: [
        Pattern.new(match: /\Gbar/, tag_as: "bar")
      ]
    )
  ]
)

It is fairly unintuitive that /^/ doesn't work inside sub captures. And equally unituitive that the solution is to wrap it up in a PatteernRange.

Yeah I think this is another perfect case for the library to handle. Like with the Atom support though, handle it after the gem is published

matter123 commented 5 years ago

Since quite a few check items and potential additions perform transformations on the patterns, there should be some way of enabling logging whenever a transformation occurs.

jeff-hykin commented 5 years ago

I think I'm going to try to publish this as a pre-alpha gem sometime this week. Its the first week of the semester and I want to try to re-structure this repo (get each language in a separate repo, and coordinate the change with Alexr00) before the semester gets into full swing, even though the checklist is far from done

The only thing I want to change before that is the import/export method. I think I'm going to change it to something like

require 'textmate_tools'

grammar = Grammer.get_exportable_grammar
grammar.exports = [
    :repo_name_1,
    :repo_name_2,
]
grammar.external_repos = [
    :std_space,
    :variable,
]

grammar[:repo_1] = Pattern.new()

The .exports are the names of the repos that should be exported The .external_repos are the ones you expect to already be defined by something else Every time that grammar sees a repo name (even in the includes of a pattern), if its not in the .exports and not in the .external_repos, it will create a mapping from that repo name to a randomly generated number so that the name never conflicts with a name from another package.

matter123 commented 5 years ago

That sounds pretty nice. I have two concerns:

  1. The mapping should probably be stateless;
  2. Rerunning npm run build should not change the generated grammar(builds should be reproducible)

so I'm thinking:

jeff-hykin commented 5 years ago

Yeah salt based on file name/path would be ideal

matter123 commented 5 years ago

With a unified export, to another grammar or to a file, how does an import look like?

jeff-hykin commented 5 years ago

I was actually thinking they would be different (exporting a grammar to JSON vs exporting patterns)

matter123 commented 5 years ago

Ok so: Grammar::import(path) imports a json file to a grammar Grammar#save(path) saves a grammar to a json file Grammar#import(path_or_grammar_object) merge another grammar into this one Grammar::get_exportable_grammar returns a new ExportableGrammar

matter123 commented 5 years ago

Is importing a grammar from a path a useful feature? As it requires either converting a tag back into a pattern, or allow the grammar to store patterns as a tag.

jeff-hykin commented 5 years ago

These were the only ones I was planning on having

a_grammar.import(path)
a_grammar.save_to(tm_language_path)
Grammar.get_exportable_grammar()

I think importing from a file is ideal because it prevents namespace pollution and guessing.

For example with with Doxygen, I didn't know what variables/functions it contributed. I had to open up the file and actually read the whole thing to realize it was really just exporting the doxygen function. When external files contribute common names like line_comment, they run the risk of accidentally overriding or accidentally getting overridden by something in generate.rb.

matter123 commented 5 years ago

is it Grammar#import("/path/to/file.rb") or Grammar#import("/path/to/file.tmLanguage.json")?

jeff-hykin commented 5 years ago

Oh I see, yeah there will need to be some name differentiating. I think there should be a way to import a json-tmLanguage, and have all of the patterns it contributes be kind of constant/untouchable. The main reason being that people are probably going to be starting with a legacy TM grammar that they want to slowly replace patterns from.

jeff-hykin commented 5 years ago

maybe Grammar.fromTmLanguage() and it returns a fully initialized grammar object

matter123 commented 5 years ago

Okay, that sounds reasonable, The resultant Grammar should probably have linters/analysis disabled so that Grammar::fromTmLanguage doesn't have to actually convert a tag into a Pattern Objects.

matter123 commented 5 years ago

In theory, the rewrite has

matter123 commented 5 years ago

source/rewrite/example_grammar.rb contains an example import/export:

require_relative 'grammar'

ex = Grammar.new_exportable_grammar

ex.exports = [
    :abc,
]

ex.external_repos = [
    :p123,
]

ex[:abc] = Pattern.new(
    match: /abc/,
    includes: [
        :p123,
        :test_pat,
        :$initial_context,
    ]
)

ex[:test_pat] = /abcd/

g = Grammar.new(
    "test",
    "source.test",
)

ex.export # just to confirm that export is idempotent

g.import(ex.export)

g.debug
jeff-hykin commented 5 years ago

In theory, the rewrite has

  • Grammar::new_exportable_grammar returns a new ExportableGrammar which is suitable for importing
  • Grammar::fromTmLanguage(path) - loads a json file as an ImportableGrammar which has additional safeguards on accessing hashes
  • ExportableGrammar#export preps the ExportableGrammar for exporting (stub)
  • Grammar#import(path_or_export):

    • if path_or_export is an ExportableGrammar import the repository
    • if path_or_export is a path, eval the file and import the ExportableGrammar

Yup, this sounds great 👍

jeff-hykin commented 5 years ago

source/rewrite/example_grammar.rb contains an example import/export:

require_relative 'grammar'

ex = Grammar.new_exportable_grammar

ex.exports = [
    :abc,
]

ex.external_repos = [
    :p123,
]

ex[:abc] = Pattern.new(
    match: /abc/,
    includes: [
        :p123,
        :test_pat,
        :$initial_context,
    ]
)

ex[:test_pat] = /abcd/

g = Grammar.new(
    "test",
    "source.test",
)

ex.export # just to confirm that export is idempotent

g.import(ex.export)

g.debug

Thanks for full hashing this out. I think this is pretty ideal. The only thing I might change is the ex.export # just to confirm that export is idempotent. A file might want to dynamically contribute values, like the generateBlockFinder. This should possible because they'll still have access to the exportable grammar object. We can give the exportable grammar access to the real grammar and let it inject repos dynamically.

With that being the case, I think we can have a class-variable Grammar@@exportable_grammar_stack. Whenever a exportable grammar is constructed, it places itself on that stack. When using Grammar#import(path) we can see how many exportable grammars are created by a particular file, and then warn them if there is no exportable grammar, or if there are multiple (ideally there would only be one, kind of similar the the Javascript module.exports)

matter123 commented 5 years ago

Can you provide an example of what dynamically contributed values might look like?

matter123 commented 5 years ago

Would ExportGrammar#[] where a symbol in external_repos is provided, return a PlaceholderPattern that during import is resolved to the actual pattern work?

For that matter should Grammar#[] just return a placeholder pattern whenever the key doesn't exist?

jeff-hykin commented 5 years ago
require_relative 'grammar'

ex = Grammar.new_exportable_grammar

ex.exports = [
    :functions
]
ex.external_repos = [
    :numbers,
]

# start as empty context
ex[:functions] = []

def generateFunctionFor(name)
    # range version
    ex[name.to_sym] = PatternRange.new(
        start_pattern: /#{name}\(/,
        end_pattern: /\)/,
        includes: [
            :numbers
        ]
    )
    # inline version
    ex[(name+"_inline").to_sym] = Pattern.new(/#{name}\(.+\)/)
    # add them to the function context that is exported
    ex[:functions] = [
        *ex[:functions],
        (name.to_sym),
        (name+"_inline").to_sym
    ]
end

g = Grammar.new(
    "test",
    "source.test",
)
g.import(ex)
matter123 commented 5 years ago

To clarify the main generate file calls generateFunctionFor() right?

After ex is imported, g could assign itself as a parent Grammar, and at that point [] and []= call the parent grammars functions.

Alternatively, ex.reexport could be added which basically does g.import(ex) again.

jeff-hykin commented 5 years ago

Yeah, all that sounds correct.

I think the way to do it will be for the exportable grammar's []= to assign something like a PlaceholderPattern (maybe ImportedPattern) to the main grammar. It would be the same as a regular pattern, but it would contain a back-reference to the exportable grammar (so that it has access to the ex.exports and ex.external_repos).

Repo names ex[:repo_name] that are not in ex.exports or ex.external_repos, should be converted on the spot to their hashed/no-conflict versions ex[:repo_name]= becomes real_grammar[:some_hash]=.

However, the Pattern.includes Pattern.new(match: /thing/, includes:[ :repo_name ]) will need to be converted to Pattern.new(match: /thing/, includes:[ :some_hash ]) once the real grammar is being exported to json. This is because things can get dynamically inserted into the includes:. We can probably have ImportedPattern inherit from Pattern, then override it's to_tag to perform all of the replacements that need to be done on the includes.

matter123 commented 5 years ago

Currently export uses Pattern#transform_includes to fixup its includes (see fixupValue on line 147 of grammar.rb)

jeff-hykin commented 5 years ago

👍 perfect, then the override can be done there. I think the full-grammar tests (making sure all repos exist) are the only things that would depend on the transforms.

matter123 commented 5 years ago

To summarize then:

Thoughts on Grammar#[] returning a PlaceholderPattern rather than nil?

jeff-hykin commented 5 years ago

if the export grammar wants to inject dynamic patterns it should call ex.re_export

Actually I was thinking by default the exportable grammar would be dynamic. Something like this:

class Grammar
    @@exported_grammar_stack

    def import(filepath)
        before_size = @@exported_grammar_stack.size 
        require filepath
        after_size = @@exported_grammar_stack.size
        if before_size == after_size + 1
            exported_grammar = @@exported_grammar_stack.pop()
            # give the exported grammar access to the real grammar
            exported_grammar.attach_to_grammar(self)
        else
            raise "error: didn't export a grammar or exported more than 1 grammar in #{filepath}"
        end

    end
end

class ExportGrammar
    @real_grammar = nil
    @repo = {}

    def initilize
        # tell all grammars that an exportable grammar was just made
        Grammar.exported_grammar_stack.push(self)
    end

    def [](key)
        key = self.hash_key_if_needed(key)
        return @repo[key]
    end

    def []=(key, value)
        pattern = ImportedPattern.new(value)
        key = self.hash_key_if_needed(key)
        @repo[key] = pattern
        if @real_grammar != nil
            @real_grammar[key] = pattern
        end
    end

    def attach_to_grammar(real_grammar)
        @real_grammar = real_grammar
        # copy over all current stuff
        for each_key, each_value in @repo
            @real_grammar[each_key] = each_value
        end
    end
end
matter123 commented 5 years ago

Slight issue, If the generate.rb requires a file that exports a grammar first when import is called there is no change to the @@exported_grammar_stack, load would work though, right?

matter123 commented 5 years ago

The initialize for ExportGrammar could register with a key to its filename, that would allow for nested requires and still allow for an error with an unexpected number of grammars.

jeff-hykin commented 5 years ago

Slight issue, If the generate.rb requires a file that exports a grammar first when import is called there is no change to the @@exported_grammar_stack, load would work though, right?

load would work. We could also change it from a stack to a Hash and then we could make sure the same file never gets imported twice