jeff-hykin / fsharp-injections-demo

MIT License
0 stars 0 forks source link

Discussion #1

Open jeff-hykin opened 2 years ago

jeff-hykin commented 2 years ago

Based on

But I've been stopped by weird behavior 

I think the ruby generator wasn't explained very well because it addresses a large part of exactly what you're bringing up.

It detects (static analysis) and produces warnings that explain why certain patterns won't behave how you expect. For example, if you have a pattern that uses a quantifier on a capture group and then you try to tag the capture group, the library catches that and tells you only the first match is going to be highlighted and it explains how to fix it. Another example is if you have a pattern that accidentally matches a zero length string, it catches that, tells you what going on, what pattern is the issue, and allows you to override the warning if you decide "yes I actually meant to do that".

The yaml version isn't even close to addressing the very problems being complained about, but the ruby library is.

As I mentioned there are a few tricks to do it much more succinctly than either JSON or Ruby with plain YAML

I think there's a misunderstanding here. There's plenty of decent arguments against using the ruby version, and I love yaml (it's my favorite serialization format by far). But, it is computationally provable that the ruby format can be more succinct than any of the static format methods. It would essentially need to be incompressible (a binary file with maxium entropy) for a static file to be less succinct. Here's a demo showing how much more succinct this F# injection could be (there's only 2 files in that branch and one of them is the tmLanguage itself). This is partly because library was made to compress grammars and partly because Ruby has some nice syntax. Having the extra power of yaml-variables helps, but don't forget the regex is still being double -- and sometimes triple -- escaped. More importantly, the yaml version must explicitly mention capture group numbers, and it can't avoid duplication of tags because at minimum the variables representing the tag names have to be copied and pasted every time the pattern is used in multiple places. The ruby code maps 1-to-1 to the tmLanguage format, but at a much higher compression level because it automates repetition; for all practical purposes it can always be more succinct.

Just look at the example with the F# injection, there is tons of code duplication. I left the duplication in the master branch to make the 1-to-1 mapping more clear. Despite the example syntax being very small, the ruby version can be half the size with better readability for people who don't have 15 years of experience with regex.

texastoland commented 2 years ago

it addresses a large part of exactly what you're bringing up.

I really respect your work. For me it doesn't. In no way does that diminish your project. But I'll explain my perspective since you shared yours.

It detects (static analysis) and produces warnings

I agree that's useful. Better as a linter though. Of course it's simpler to implement as a Ruby DSL. But that also introduces the complexities and verbosity of an entire PL as opposed to a constrained DSL like Iro or the default YAML config. Regardless nothing short of a really clever transpiler could resolve limitations like tokenizing func arg identically as func\narg.

Another example is if you have a pattern that accidentally matches a zero length string

Did I mention I hate TextMate grammars? That specific case effectively prevents exhaustive parsing. What's worse is no one including Microsoft (or you or I) has documented it.

The yaml version isn't even close to addressing the very problems being complained about

Linting would require an extension; duplication is already partially addressed. You alluded that YAML provides the ability to duplicate or even merge blocks although you can't modify existing keys. On the other hand TextMate itself provides the ability to embed scope names from match groups acting like CSS for grammars:

scopeName: inline.template-fsharp-highlight
injectionSelector: "L:- comment - (string - meta.embedded)"
patterns:
  - include: "#template"
repository:
  template:
    contentName: string.quoted.triple.fsharp.template.${1:/downcase}
    begin: |
      (?xi)     # Ignore whitespace and case
      (?<=      # After valid lang ($1 above) with any prefix
        (html|svg|sql)
      )
      \ *       # Arbitrary spaces
      (?=\$""") # Before interpolated string
    end: ^      # Before next line text
    patterns:
      - include: "#template_string"
  template_string:
    contentName: meta.embedded.block
    begin: \$"""
    end: |-
      """
    beginCaptures:
      0: { name: punctuation.definition.string.begin.fsharp }
    endCaptures:
      0: { name: punctuation.definition.string.end.fsharp }

Now it's trivial to embed HTML modularly for example:

scopeName: inline.template-fsharp-highlight.html
injectionSelector: "L:string.quoted.triple.fsharp.template.html meta.embedded.block"
patterns: { include: text.html.derivative }

the ruby format can be more succinct than any of the static format methods.

Compared to YAML yes; more than Iro no.

Here's a demo showing how much more succinct this F# injection could be

There's subjectively more syntactic clutter and mental overhead than my version.

don't forget the regex is still being double -- and sometimes triple -- escaped.

Note: only regex escaping is necessary thanks to YAML block scalars.

Just look at the example with the F# injection, there is tons of code duplication.

I'm not the original author of that version. I don't disagree tmLanguage in any format is fundamentally flawed. For my narrow use case a Ruby dependency would make it too abstruse though. I'd happily use Iro in the future. But in general I find the existing Tree-sitter extension the happiest compromise rather than supporting a technology I want to be deprecated (3 freaking years since Atom did it).

better readability for people who don't have 15 years of experience with regex.

You can't completely abstract regex either and it clearly isn't the hard part writing TextMate grammars. I'm not trying to be contentious. I was impressed when I saw your Bash grammar 1 year ago. The Reason language originally did the same thing except in OCaml. I just prefer a different level of abstraction.

jeff-hykin commented 2 years ago

But I'll explain my perspective since you shared yours.

Thank you 😀



There's subjectively more syntactic clutter and mental overhead than my version.

I totally agree. Runtimes are more complicated than static files like yaml, and sometimes static files are better fit. I could write several articles on the downsides and shortcomings of the existing ruby library.



Linting would require an extension; duplication is already partially addressed. You alluded that YAML provides the ability to duplicate or even merge blocks although you can't modify existing keys. On the other hand TextMate itself provides the ability to embed scope names from match groups acting like CSS for grammars:

This sounds interesting, but I'm not sure that I understand the point being made. Is it pointing out the ${1:/downcase} inside of string.quoted.triple.fsharp.template.${1:/downcase}? That downcase feature is news to me! However, the ruby library already handles the .${1} case and allows using names instead of numbers.



You can't completely abstract regex

Really? I'll propose a challege; provide a single regex expression that I cannot abstract into a programmatic representation and I'll fully concede this point. Recusion, branch resets, conditionals, anything that oniguruma can handle is on the table.



rather than supporting a technology I want to be deprecated

This is true, I do feel bad about that part. I've been working with the Atom community to get it on par with VS Code so I can ditch VS Code entirely.



Note: only regex escaping is necessary thanks to YAML block scalars.

Sadly no, and here's a quick example: 5 years ago I was using Sublime, editing one of the grammars (which were yaml files). I had no idea what yaml was, or what textmate was, but I knew regex (at least python regex, not oniguruma regex). I was trying to get a pattern to match a space at the start. But, no matter how many spaces I added, it would fail. And if I added + (space plus) to the begining would give an error. It took me a week to figure out I needed to quote everything (and double escape everything) in order to get it to match a space at the begining.

Seems rare? well think again, because of Textmate's afwul matching priority, in complex languages like C/C++, often more than half of all patterns need to start with matching spaces. Depending on the language, even matching template strings can need to match leading spaces.

On top of this, the current yaml generators look for double curly brackets.

So, combining the two, if you try to match something like +\{{2,4} at the begining of a one-line block scalar (or normal scalar), you'll hit all three of the escape levels with a mere 10 chars. And I'm not even sure myself how to escape it since I'm not familiar with how the generator wants the curly brackets escaped.

Now, Yaml IMO has the best escaping of any language (static or otherwise). However, it is nonetheless escaping, and its edgecases are very sharp, requiring in-depth knowledge of the yaml specification to handle them elegantly. If someone is unsure about textmate (basically everyone), is unsure about oniguruma regex, and isn't an expert on yaml, they will likely have no idea which of those things are the cause of their problem. They'll basically have to become an expert on each, or do an absurd amount of trial-and-error to figure out how to solve it.

Ruby is uniquely fitted to solve this because 1. Oniguruma/Textmate regex is based on Ruby regex, so googling ruby regex will actually answer the question at hand and has lots of results (which include examples showing regex escaping in ruby). 2. Ruby has a proper regex literal, so there is only ever a single level of escaping and there are no sharp edgecases like leading spaces or double curly braces.



the ruby format can be more succinct than any of the static format methods.

Compared to YAML yes; more than Iro no.

We can always just add another method to the ruby library to make it more succinct. In Iro you're left with what iro provides you, but with any full language, the world is your oyster because you can extend it. No matter what iro can do, you can write a helper that makes it where the grammar definiton is smaller than what it would be in iro.



nothing short of a really clever transpiler could resolve limitations like tokenizing func arg identically as func\narg

Its my understanding that no transpiler can resolve the \n problem. Ex: int main\n = 10; vs int main\n(){ return 10}; there's no way to tag main as a variable in the first case and a function in the second case using a pure textmate parser.



it clearly isn't the hard part writing TextMate grammars

Maybe there's some other evidence you have. The evidence I have is that I'm a TA at a university and advisor for a large programming team in a club. For all my students that are interested in making grammars, the regex is very much the hard part. They regularly, and almost entirely, struggle with regex, especially reading/understanding existing regex.