Add drracket:comment-matches?

Although I'm not 100% sure yet, I think there's a case to be made for adding a new drracket:comment-matches, to complement drracket:paren-matches and drracket:quote-matches?

Although possibly too "Emacs-centric", I'd suggest it be like quote-matches except allow the strings to be 1 or 2 characters long. That is, it could handle comment pairs like #| and |#, as well as ; and \n. (Or /* and */ as well as // and \n. Or whatever.)

[I would suggest it not handle s-expression comments, because they are already a special case, handled specially by the new lexer hash-table attribs. i.e. Things you want to lex/indent/nav normally, but perhaps only display as comments. That's another kettle of fish.]

Why? Maybe it wouldn't be used/needed by DrRacket. But I think it could be useful in Emacs. Especially in the case where a lang defaults to s-expression indent and navigation --- all it really provides is a syntax-colorer, which is possibly different than the default.

In that case, Emacs would be using only the lang's syntax colorer. Although Emacs must wait for tokenization to know where comments are, it needn't wait (for s-expr langs) to know where quotes and parens are, given paren-matches and quote-matches.

If there were also a new comment-matches, then the default Emacs "char-syntax table" mechanism could be used for all three -- quotes, parens, comments. This would enable some possible optimizations like having navigation, indent, and a basic level of font-lock (coloring) available "instantly", as well as using the normal lazy font-lock mechanism.

But I'm not 100% sure. I mean, when a lang supplies only a colorer, and is otherwise s-expressions? Someone might prefer to keep using normal racket-mode, not a special racket-hash-lang-mode. So I'm not entirely sure about the UX scenarios, yet.

p.s. A lang like rhombus that supplies everything, even drracket:grouping-positions, is in some way the easiest case: Assume nothing, use everything.

What motivates this issue is me thinking about langs that don't supply everything, and what's the most satisfactory experience there.

Is the idea that a language would say what the opening and closing characters are for nestable comments? If so, what would the IDE do with this information?

I'm asking partly because when I made drracket:quote-matches, I knew that I wanted it only for specific keybindings in DrRacket (that were wrong in rhombus) so that's why it is character based. I figured that if there was something interesting to do with multiple characters, then we could add strings to it and have characters just work the same as strings of length one with that character inside. So it isn't any inherent limitation of strings or same-open/close delimiters that made me pick characters; it was because I was going to bind those characters as keystrokes!

Is the idea that a language would say what the opening and closing characters are for nestable comments?

Yes.

If so, what would the IDE do with this information?

Possibly nothing. Possibly use it to enable recognition of comments by low-level classification mechanisms.

For example Emacs has "syntax tables", which are ~= hash-tables from characters to "syntax" types such as "I {start end} a {comment string}". These are very fast and enable a basic level of navigation and coloring, for comments and strings, to "just work".

Now, Emacs also has the concept of a customizable "forward/backward expression function", which in theory could be backed by drracket:grouping-position. However:

Some Emacs things don't use it; they look directly at the low-level character syntax. They are probably wrong to do so but they exist and they are popular.
Many langs -- in fact all but rhombus, so far -- don't even supply drracket:grouping-position.

So for the broadest compatibility with low-level features and third-party packages, it would help if a lang can supply a notion of comment start and end characters.

(If some lang has some exotic notion of comments, that can't be expressed that way? It needn't supply this. Some things in Emacs won't work as well. The thing that can work via tokenization alone, will.)

Like I mentioned above, in some sense a #lang rhombus that supplies everything is the simplest case -- use all of it. The sometimes trickier cases are things like #lang scribble that supply indent-line and coloring, but nothing else. That's part of the motivation here.

I don't know if tossing a blob of Emacs Lisp at you will be helpful :smile: but the comment for the comment case below is also part of what I'm thinking about.

(defun racket--hash-lang-on-new-token (token)
  (with-silent-modifications
    (cl-flet ((put-face (beg end face) (put-text-property beg end 'face face))
              (put-stx  (beg end stx ) (put-text-property beg end 'syntax-table stx)))
      (pcase-let ((`(,beg ,end ,kinds) token))
        (racket--hash-lang-remove-text-properties beg end)
        ;; 'racket-token is just informational for me for debugging
        (put-text-property beg end 'racket-token kinds)
        (dolist (kind kinds)
          (pcase kind
            ('parenthesis
             (put-face beg end 'parenthesis))
            ('comment
             ;; This is super important because, unlike paren-matches or
             ;; quote-matches, where we can set up a char syntax-table
             ;; entries (at least for single-character paren-matches)
             ;; there is nothing like a "drracket:comment-matches". As a
             ;; result, until the lexer applies these, various Emacs
             ;; commands/modes won't know about comments. This also is a
             ;; roadblock to changing what we're doing here to be just
             ;; lazy font-lock instead of eagerly putting face props on
             ;; the whole buffer.
             (put-stx beg (1+ beg) '(11)) ;comment-start
             (put-stx (1- end) end '(12)) ;comment-end
             (let ((beg (1+ beg))         ;comment _contents_ if any
                   (end (1- end)))
               (when (< beg end)
                 (put-stx beg end '14))) ;generic comment
             (put-face beg end 'font-lock-comment-face))
            ('sexp-comment
             ;; This is just the "#;" prefix not the following sexp.
             (put-stx beg end '(14))    ;generic comment
             (put-face beg end 'font-lock-comment-face))
            ('sexp-comment-body
             (put-face beg end 'font-lock-comment-face))
            ('string
             (put-face beg end 'font-lock-string-face))
            ('text
             (put-stx beg end (standard-syntax-table)))
            ('constant
             (put-stx beg end '(2))     ;word
             (put-face beg end 'font-lock-constant-face))
            ('error
             (put-face beg end 'error))
            ('symbol
             (put-stx beg end '(3))     ;symbol
             ;; TODO: Consider using default font here, because e.g.
             ;; racket-lexer almost everything is "symbol" because
             ;; it is an identifier. Meanwhile, using a non-default
             ;; face here is helping me spot bugs.
             (put-face beg end 'font-lock-variable-name-face))
            ('keyword
             (put-stx beg end '(2))     ;word
             (put-face beg end 'font-lock-keyword-face))
            ('hash-colon-keyword
             (put-stx beg end '(2))     ;word
             (put-face beg end 'racket-keyword-argument-face))
            ('white-space
             ;;(put-stx beg end '(0))
             nil)
            ('other
             ;;(put-stx beg end (standard-syntax-table))
             nil)))))))

Is the idea that a language would say what the opening and closing characters are for nestable comments?

Yes.

If so, what would the IDE do with this information?

Possibly nothing. Possibly use it to enable recognition of comments by low-level classification mechanisms.

I'm not sure how that would work, or are you saying it would do it in a way that was possibly incorrect (if, eg, the open comment characters were in a string or a symbol)?

For example Emacs has "syntax tables", which are ~= hash-tables from characters to "syntax" types such as "I {start end} a {comment string}". These are very fast and enable a basic level of navigation and coloring, for comments and strings, to "just work".

Now, Emacs also has the concept of a customizable "forward/backward expression function", which in theory could be backed by drracket:grouping-position. However:
1. Some Emacs things don't use it; they look directly at the low-level character syntax. They are probably wrong to do so but they exist and they are popular.

2. Many langs -- in fact all but rhombus, so far -- don't even supply `drracket:grouping-position`.
So for the broadest compatibility with low-level features and third-party packages, it would help if a lang can supply a notion of comment start and end characters.

(If some lang has some exotic notion of comments, that can't be expressed that way? It needn't supply this. Some things in Emacs won't work as well. The thing that can work via tokenization alone, will.)

Like I mentioned above, in some sense a #lang rhombus that supplies everything is the simplest case -- use all of it. The sometimes trickier cases are things like #lang scribble that supply indent-line and coloring, but nothing else. That's part of the motivation here.

I think the idea is that, if it isn't supplied, then they get the defaults (i.e. the setting for #lang racket). Maybe this isn't clear enough in the docs, tho.

I'm not sure how that would work, or are you saying it would do it in a way that was possibly incorrect (if, eg, the open comment characters were in a string or a symbol)?

I'm not sure how that would not work? Strings and comments are mutually exclusive; quote-matches and comment-matches describe delimiters that begin and end a string or comment state.

[The sort of low-level parsing I'm referring to, in the case of Emacs, is described here. I haven't researched other editors and IDEs; I'm assuming they can do something similar.]

I think the idea is that, if it isn't supplied, then they get the defaults (i.e. the setting for #lang racket). Maybe this isn't clear enough in the docs, tho.

That seems inconsistent with drracket:paren-matches and dracket:quote-matches.

If a lang doesn't supply one of those, for parens or for strings, it's a clear/specific signal "use default s-expression delimiters" for parens or for strings.

Otherwise, a lang can specify its delimiters for parens or strings. A lang can do so without necessarily needing to supply a whole drracket:grouping-position implementation, and, it can tell the host specifically what those delimiters are.

That's the status quo for parens and strings.

Comments are similar to strings (we just discussed how they're directly mutually exclusive). Why not have a similar, drracket:comment-matches function providing similar benefits to lang writers and to hosts?

I'm not sure how that would work, or are you saying it would do it in a way that was possibly incorrect (if, eg, the open comment characters were in a string or a symbol)?

I'm not sure how that would not work? Strings and comments are mutually exclusive; quote-matches and comment-matches describe delimiters that begin and end a string or comment state.

It sounds like you're saying something to the effect of "if I know the sequence of bytes (or unicode code points or other such low-level thing but I'll stick with bytes for simplicity) for each variant on open paren, close paren, start a string, end a string, start comment, and end comment, then I can scan forward for those in the buffer ignoring the rest of the characters and reliably know where the strings are and where the parens are". I do not think that this property is true in general. I think that, in general, one has to actually parse the program to know whether a particular sequence of bytes is actually to be treated as an open parenthesis or if it is inside a string or something else that might make it not actually be a paren.

Taking racket's reader as an example, here's a valid racket program that doesn't work with this kind of plan (and what I see when I preview this message is incorrect coloring on github's part):

#lang racket
(define \; 7)
(define |"| 6)
(* |;| \")

I don't know TeX well enough to come up with an example for it but I believe it has the property that you have to actually process the definitions you've seen before in order to know how to parse the things that come afterwards.

The C programming language also has this problem with types. Depending on the type declarations that have come before, what comes next might parse multiple different ways. Granted, this probably doesn't affect comments or parentheses. But I just don't know these languages as well as I know Racket's reader so it wouldn't surprise me if there were weird gotchas that means various reasonable properties about how to short-circuit parsing would break in these languages.

Stepping back, the approach we've taken with DrRacket is that you cannot really know anything without asking the language to do an actual parse (we even had to add the "backup delta" to correctly handle scribble coloring and do the optimization we were talking about earlier where a tokenization can "catch up" to a previous one, just offset by the amount of text that was inserted). Sadly, we've not restricted how #lang based language parsers work to only languages that have such nice properties.

[The sort of low-level parsing I'm referring to, in the case of Emacs, is described here. I haven't researched other editors and IDEs; I'm assuming they can do something similar.]

I think the idea is that, if it isn't supplied, then they get the defaults (i.e. the setting for #lang racket). Maybe this isn't clear enough in the docs, tho.

That seems inconsistent with drracket:paren-matches and dracket:quote-matches.

If a lang doesn't supply one of those, for parens or for strings, it's a clear/specific signal "use default s-expression delimiters" for parens or for strings.

Otherwise, a lang can specify its delimiters for parens or strings. A lang can do so without necessarily needing to supply a whole drracket:grouping-position implementation, and, it can tell the host specifically what those delimiters are.

That's the status quo for parens and strings.

I think we might be saying the same thing. I'm saying that if a language does not supply drracket:paren-matches and drracket:quote-matches then those languages should be the same as if specific things were in fact supplied. These specific things are what I was calling the defaults. And yes, they are what makes sense for Racket's reader. Is that what you're saying?

I do see the documentation doesn't explicitly say that, but I would be happy to add it. Maybe we should even have a setup/API where the code that asks for the value of drracket:paren-matches doesn't get back the answer "there is no setting" but always gets back what the default is when there isn't actually an answer registered. That is, we can put the defaults into only one place in the code (instead of putting them in the docs and then multiple places).

I was exploring making it more like an automatic cross-fade of techniques... based on what the lang creator wanted to supply or not... based on how "fancy" (or non-s-expression-y) the wanted their lang to be.

But I'm over budget. Enough.

I'll have racket-hash-lang-mode work strictly in terms of tokenization.

If some people find it perfect but too slow, they can always disable racket-hash-lang-mode and use just traditional racket-mode, for s-expression langs. A lang creator can even make some bespoke Emacs mode that does perform well for their lang.

racket / drracket

Add drracket:comment-matches? #534