Scanner iterates unicode scalars but would previously split strings by character index. Since multiple unicode scalars might be combined into a single character, this could lead to incorrect tokenization and therefore unknown tags.
This patch changes Scanner to use the indexes of the respective UnicodeScalarView.
All tests pass and performance is basically unchanged.
Bigger picture: I wonder if Unicode.Scalar should be used at all in Lexer/Scanner, as splitting is conceptually always done by character.
Probably fixes #276 (does for my case).
Scanner
iterates unicode scalars but would previously split strings by character index. Since multiple unicode scalars might be combined into a single character, this could lead to incorrect tokenization and therefore unknown tags. This patch changesScanner
to use the indexes of the respectiveUnicodeScalarView
.All tests pass and performance is basically unchanged.
Bigger picture: I wonder if
Unicode.Scalar
should be used at all inLexer
/Scanner
, as splitting is conceptually always done by character.