Open vemoo opened 4 years ago
I am very hesitant about supporting empty tokens, as they open all kinds of nasty edge cases.
In particular, with empty tokens, TokenAtOffset
type does not really make sense, as there may be arbitrary many empty tokens at a given position, and, if we fully support empty tokens, we should return a Vec
or an Iterator
here.
Initially (way back in fall
) I've used empty tokens and nodes extensively (for example, syntax errors like "missing ;
" were modeled as empty nodes), but that caused a lot of problems down the line, exectly because the API became non-intuitive due to all of edge cases.
I'd love to go further and even forbid empty internal nodes, but that unfortunately hits a really horrible edge-case: an empty file. I am even wondering if the API would be better if we had an explicit EMPTY_FILE token for this case....
But returning TokenAtOffset::Single
in the presence of empty tokens also doesn't make sense.
For the example of tokens ["A", "+", "", "*", "C"]
token_at_offset(2)
before this change returns TokenAtOffset::Single("+")
but why not TokenAtOffset::Between("+", "*")
or TokenAtOffset::Single("*")
?
I realise there could be arbitrary many empty tokens, but TokenAtOffset::Between("+", "")
could be considered left biased in a way. With empty tokens the return type will always be somthing like TokenAtOffset::Between(<non_empty>, <empty>)
.
Other alternative would be to disallow them when building them.
Other alternative would be to disallow them when building them.
Yeah, we currently "soft-disalow" them, in that code assumes that tokens are non-empty. We don't have any active assetions though, because nothing actively breaks with empty tokens. If one has to have empty tokens, one can write custom algorithms which do something resonable
When I first learned indexes, I learned them like the cursor of a text editor:
|a|b|c|d|e|f|g|
^ ^ ^ ^ ^ ^ ^ ^
0 1 2 3 4 5 6 7
Per this understanding of the offset (cursor position), then TokenAtOffset::Between
is definitely correct for when the cursor is next to a zero-width token.
For example, given the tokens ["a", "", "b"]
and offset 1, the cursor could either be at a|␣b
or a␣|b
. So for this, TokenAtOffset::Between("a", "")
or ::Between("", "a")
would be correct, you could argue ::Between("a", "b")
(but I'd argue against), and ::Single("a")
is just incorrect (the position is not within the span of the a
token, but after it).
This is usefull if you are creating empty tokens to represent virtual tokens that were added to make the ast make sense.
Before, for the tokens
["A", "+", ""]
callingtoken_at_offset(2)
would returnTokenAtOffset::Single("+")``. Now it returns
TokenAtOffset::Between("+", "")`.I've added a simple test, and I've also run the
rust-analyzer
tests with the modifiedrowan
crate.Something to take into account with this implementation is that if
token_at_offset
is called in a non root node withoffset
at the start or end of the node,TokenAtOffset::Between
will be returned with a token from outside this node.