rust-analyzer / rowan

Apache License 2.0
697 stars 57 forks source link

take empty tokens into account #51

Open vemoo opened 4 years ago

vemoo commented 4 years ago

This is usefull if you are creating empty tokens to represent virtual tokens that were added to make the ast make sense.

Before, for the tokens ["A", "+", ""] calling token_at_offset(2) would return TokenAtOffset::Single("+")``. Now it returnsTokenAtOffset::Between("+", "")`.

I've added a simple test, and I've also run the rust-analyzer tests with the modified rowan crate.

Something to take into account with this implementation is that if token_at_offset is called in a non root node with offset at the start or end of the node, TokenAtOffset::Between will be returned with a token from outside this node.

matklad commented 4 years ago

I am very hesitant about supporting empty tokens, as they open all kinds of nasty edge cases.

In particular, with empty tokens, TokenAtOffset type does not really make sense, as there may be arbitrary many empty tokens at a given position, and, if we fully support empty tokens, we should return a Vec or an Iterator here.

Initially (way back in fall) I've used empty tokens and nodes extensively (for example, syntax errors like "missing ;" were modeled as empty nodes), but that caused a lot of problems down the line, exectly because the API became non-intuitive due to all of edge cases.

I'd love to go further and even forbid empty internal nodes, but that unfortunately hits a really horrible edge-case: an empty file. I am even wondering if the API would be better if we had an explicit EMPTY_FILE token for this case....

vemoo commented 4 years ago

But returning TokenAtOffset::Single in the presence of empty tokens also doesn't make sense. For the example of tokens ["A", "+", "", "*", "C"] token_at_offset(2) before this change returns TokenAtOffset::Single("+") but why not TokenAtOffset::Between("+", "*") or TokenAtOffset::Single("*")?

I realise there could be arbitrary many empty tokens, but TokenAtOffset::Between("+", "") could be considered left biased in a way. With empty tokens the return type will always be somthing like TokenAtOffset::Between(<non_empty>, <empty>).

Other alternative would be to disallow them when building them.

matklad commented 4 years ago

Other alternative would be to disallow them when building them.

Yeah, we currently "soft-disalow" them, in that code assumes that tokens are non-empty. We don't have any active assetions though, because nothing actively breaks with empty tokens. If one has to have empty tokens, one can write custom algorithms which do something resonable

CAD97 commented 4 years ago

When I first learned indexes, I learned them like the cursor of a text editor:

|a|b|c|d|e|f|g|
^ ^ ^ ^ ^ ^ ^ ^
0 1 2 3 4 5 6 7

Per this understanding of the offset (cursor position), then TokenAtOffset::Between is definitely correct for when the cursor is next to a zero-width token.

For example, given the tokens ["a", "", "b"] and offset 1, the cursor could either be at a|␣b or a␣|b. So for this, TokenAtOffset::Between("a", "") or ::Between("", "a") would be correct, you could argue ::Between("a", "b") (but I'd argue against), and ::Single("a") is just incorrect (the position is not within the span of the a token, but after it).