kaby76 / one-parser

0 stars 2 forks source link

Lexer does not create token start/stop/text attributes correctly always, which makes it hard to do source text reconstruction #2

Open kaby76 opened 3 months ago

kaby76 commented 3 months ago

PatternTest contains the following input characters:

import System
import System.Collections.Generic

namespace ONE.Test.DeclarationPattern:

    public class Test:

        public:

But, the tokenizer produces this:

[@0,0:5='import',<57>,1:0]
[@1,6:6=' ',<10>,channel=1,1:6]
[@2,7:12='System',<195>,1:7]
[@3,13:13='\n',<282>,2:0]
[@4,14:19='import',<57>,2:0]
[@5,20:20=' ',<10>,channel=1,2:6]
[@6,21:26='System',<195>,2:7]
[@7,27:27='.',<216>,2:13]
[@8,28:38='Collections',<195>,2:14]
[@9,39:39='.',<216>,2:25]
[@10,40:46='Generic',<195>,2:26]
[@11,49:49='\n',<282>,5:0]
[@12,50:58='namespace',<69>,5:0]
[@13,59:59=' ',<10>,channel=1,5:9]
[@14,60:62='ONE',<195>,5:10]
[@15,63:63='.',<216>,5:13]
[@16,64:67='Test',<195>,5:14]
[@17,68:68='.',<216>,5:18]
[@18,69:86='DeclarationPattern',<195>,5:19]
[@19,87:87=':',<218>,5:37]
[@20,93:93=' ',<282>,7:4]
[@21,90:93='    ',<1>,7:4]
[@22,94:99='public',<82>,7:4]
[@23,100:100=' ',<10>,channel=1,7:10]
[@24,101:105='class',<29>,7:11]
[@25,106:106=' ',<10>,channel=1,7:16]
[@26,107:110='Test',<195>,7:17]
[@27,111:111=':',<218>,7:21]
[@28,121:121=' ',<282>,9:8]
[@29,114:121='        ',<1>,9:8]
[@30,122:127='public',<82>,9:8]
[@31,128:128=':',<218>,9:14]
[@32,142:142=' ',<282>,11:12]
[@33,131:142='            ',<1>,11:12]

The problem is that it's difficult to reconstruct the text from these tokens:

It would be nice if the token stream reflects the input text, but it doesn't.

kaby76 commented 3 months ago

Considering this problem may exist in other grammars, the solution I chose to implement to work around this is to go character by character, select all tokens that may span the character, then verify that the text is the same as the substring text from the start and end indices of the token.

I can then create a "heat map" of the calls to "LA()" in the lexer for each character in the input. E.g.,

heatmap.html.txt