Rule name identifiers maybe shouldn't have source locations

jackfirth commented 2 years ago

So given a grammar like this:

#lang brag

program: statement*
statement: LITERAL-INTEGER

And an appropriate lexer-based tokenizer, using (parse path (make-tokenizer port)) produces syntax objects that look like this:

(program (statement 42) (statement 58) (statement 92))

All well and good. The source locations are even correct, assuming the lexer uses lexer-srcloc. Specifically, the following syntax objects have source locations:

The whole (program ...) syntax object has a source location
Each (statement ...) syntax object has a source location
Each number literal syntax object has a source location
Each occurrence of the program and statement identifiers has a source location

That last part seems off to me. The program identifier gets the same source location as the surrounding (program ...) syntax object. But the identifier itself is more of an implicitly-inserted thing from the user's perspective, like #%app or #%datum.

Where this matters to me is that I use the source locations of original syntax objects in my resyntax tool to figure out how to copy their original source code text into the refactored output code. So if one of those program or statement identifiers ends up in the output syntax object of my refactoring tool - perhaps because it was rearranging pieces of the enclosing (program ...) expression - the tool will duplicate the whole original expression when it tries to figure out how to render the output program identifier in refactored source code.

I think the rule name identifiers shouldn't have any source location information. Maybe they shouldn't even be syntax-original?, but that I'm less sure on.

mbutterick commented 2 years ago

I think the rule name identifiers shouldn't have any source location information

Does this happen with ragg too, or just brag?
if brag handles source locations in a way that’s contrary to documentation or syntax-object norms, I welcome supporting evidence that this is so. Otherwise I would invoke the existing Racket norm against changing the behavior of a package in a backward-incompatible way.

jackfirth commented 2 years ago

Just checked, and yes this happens with ragg too.
Consider this code:
```
(require syntax/modread)
```

(with-module-reading-parameterization (λ () (with-input-from-string "#lang racket/base 42" (λ () (read-syntax)))))

It produces this syntax object:
```scheme
(module anonymous-module racket/base
  (#%module-begin 42))

Both of the module and anonymous-module identifiers have a span of zero and are not original. The racket/base identifier and the 42 literal each have correct starts and spans, pointing to the racket/base and 42 substrings of #lang racket/base 42, and they're both original. The #%module-begin identifier is an odd one: it's not original but it does have a source location that is the same as the enclosing (#%module-begin 42) form. Due to the way the module and anonymous-module identifiers are handled, I suspect that's just a bug.

The whole form has a start position of 7 and a span of 14, pointing to the racket/base 42 substring, and it is not original because it contains the unoriginal module, anonymous-module, and #%module-begin pieces. The (#%module-begin 42) form also isn't original and it has the same start location and span, which I suspect is another bug since it claims to represent the racket/base 42 substring of the program code but the (#%module-begin 42) form doesn't actually contain the racket/base identifier. It should probably only claim to contain the 42 substring of the code.

It's a bit tricky to say for sure what the "intent" here is because source locations are tricky to produce and mistakes in them are rarely noticed. I think for syntax objects produced by a language's read-syntax function, these are some good guidelines:

A syntax object with a source location shouldn't contain syntax objects with source locations that are outside the container syntax object's location.
Identifiers shouldn't have a source location unless the actual text of that identifier appears in the program's code at that location.
Syntax objects shouldn't be original if they contain any unoriginal syntax objects.
If the identifier after the #lang line is used for the module's initial bindings, it should be original and have a source location.

mbutterick / brag

Rule name identifiers maybe shouldn't have source locations #34