Store positions of elements

jfmengels commented 10 months ago

One thing that has felt missing in elm-syntax's AST for a long time and has given me quite a bit of trouble in some places is the location of keywords. For instance, in an if expression, it is pretty easy to know where the if keyword is, but it is pretty hard (impossible without access to the raw source code) to know where the then or else keywords are located (without making assumptions like the file has been elm-formatted.

My proposal is to store this information in the AST node as an additional element:

type Expression =
  -- ...
  | IfBlock (Node Expression) (Node Expression) (Node Expression)

would become

type Expression =
  -- ...
  | IfBlock IfKeywords (Node Expression) (Node Expression) (Node Expression)

type alias IfKeywords =
  { ifRange : Range
  , thenRange : Range
  , elseRange : Range
  }

With this approach, these keywords can easily be ignored (IfBlock _ condition thenBranch elseBranch) and they can also be extracted through pattern matching (IfBlock { thenRange } condition thenBranch elseBranch).

I don't think we should store the whitespace, as it can be computed from the rest.

Non-exhaustive list of things I would like to have here too:

Module definition
- module keyword (can be computed though)
- exposing keyword
- .. symbol without the parens?
- , symbols
Import
- import keyword (can be computed though)
- as keyword
- exposing keyword
Type alias
- type keyword (can be computed though)
- alias keyword (they don't have to be together)
- = symbol
Custom type
- type keyword (can be computed though)
- = and | symbols
Declarations (also let declarations)
- : in
- = symbol
Expressions
- Lists
- , symbols
- Tuples
- , symbols. If we have Tuple2 and Tuple3 variants, we can store them as commaRange (for Tuple2) and firstCommaRange and secondCommaRange (for Tuple3).
- Records
- , symbols
- : symbol (( Node Pattern, {- range for : -} Range, Node Expression } ?)
- Update record
- Same as for records
- | symbol
- Record access
- . range and range for the field name separately?
- Case expression
- case keyword (can be computed though)
- of keyword
- -> symbols
- Let expression
- let keyword (can be computed though)
- in keyword`
- Negation
- - symbol (can be computed though)
- Lambda expression
- \ symbol (can be computed though)
- -> symbol
Type annotation
- The leading : symbol
- -> symbol in function type
- ,, : and | just like for records and tuples

I'm sure we can find more.

I don't know how to model the , in lists yet, without making them really combersome to use. Maybe as a commaList : List Range and people will have to use List.map2 with or something? This is not super practical since the list of items and the list of commas is off by one, unless we show good examples or have helper functions for this.

Actually, if we have non-empty lists for these, we #192

A few questions on this:

Should there be a type alias IfKeywords, or should we inline everything IfBlock { ifRange : Range, ... } (Node Expression) (Node Expression) (Node Expression) ? I think the former feels nicer, especially to read.
Should we name it IfKeywords, IfRanges, IfKeywordRanges? For this specific element "keywords" makes sense but for other things, it will mostly be about symbols (,) rather than keywords.

We should do a good job of identifying what is an element, and what isn't. For instance, I recently realized that for exposing (..), we store the position for (..), but you can write exposing ( .. ) if you'd like (the 2 dots need to be next to each other). So in this case, you might want to store

the position of the entire ( .. )
The position of the 2 dots
And potentially the position of the 2 parens, though in this case, this could be computed from the entire range.

I don't know whether we should store the position of elements whose position can easily be computed from the element's range. Example, the if keyword will always be the first 2 characters of the If expression's range ({start = ifRange.start, end = { row = ifRange.end.row, column = ifRange.end.column + 2 } }. For performance reasons mostly (having a smaller data structure), I think we could skip storing that one, and potentially have a function to do the computation. Similarly for List and Record, we don't store the position for [ and {. I agree it will a bit inconsistent though.

If we want to go for the least amount of memory, we could even store the Location (just the row and column) instead of the Range, since the size is statically known (only do this for those that can be computed like this).

@MartinSStewart You mentioned somewhere that you would like to contain more information in order to help with some fixes migrations (Or to be able with using AST for fixes). Would this suffice for you or do you feel like you'd be missing some information still? My assumption is that if you have the position of every keyword and token, you can then compare ranges to count the whitespace. The only thing you wouldn't be able to tell is the whitespace at the end of a line, but I'm not sure how important that is, since it will usually be removed by elm-format anyway.

I don't know if having this information would make writing/codegening ASTs easier or harder in practice.

MartinSStewart commented 10 months ago

So my understanding is that you want element positions in order to make text edits easier for elm-review. But what if elm-review worked with the AST directly? That is to say, if you want to provide a fix, you just update the AST data structure.

Currently this isn't practical since the AST doesn't store line breaks or comments. Without that information the code would change considerably when elm-review applies a fix. But if V8 of elm-syntax did include that information, and if elm-review operated on the level of AST changes rather than text replacement, would it make sense to track element positions?

lue-bird commented 10 months ago

The ranges are also used for the displayed error range. E.g. it's a lot nicer to mark only the ++ in [ 1 * a ] ++ bs instead of the whole expression because the error range on 1 * a won't be inside and hidden by the outer.

I love your suggestion anyway, though, even if it doesn't remove the need for these token ranges

jfmengels commented 10 months ago

As @lue-bird said, it's mostly to get more information out of the AST, to put squiggly lines in the ideal position (like on the ++), but also to have better string-based fixes.

I've created a separate issue to discuss AST replacements: https://github.com/stil4m/elm-syntax/issues/206

stil4m / elm-syntax

Store positions of elements #205