seandenigris / Resources-Live

GNU General Public License v3.0
4 stars 0 forks source link

HTML SmaCC Parser #25

Open seandenigris opened 3 years ago

seandenigris commented 3 years ago

Initial import from antlr4 is already done. I used John Brant's script to convert the grammar for both the lexer and parser from https://raw.githubusercontent.com/antlr/grammars-v4/master/html. I pasted the results into the source view of https://github.com/seandenigris/Resources-Live/blob/master/src/ResourcesLive/RlHTMLParser.class.st, which also generated https://github.com/seandenigris/Resources-Live/blob/master/src/ResourcesLive/RlHTMLScanner.class.st, but the parser does not work. To fix it (per John Brant on Discord GToolkit help channel 10/12/2020):

Looking at your grammar, I think the next step would be to try to fix the TODO parts that are in the grammar that the conversion tool couldn't handle. It appears that there are two main issues with the grammar that weren't handled by the conversion. The first is that SmaCC doesn't have non-greedy matching for the scanner (.?). The other is the pushMode/popMode code. For the non-greedy matching, the regex needs to be modified. Some of them are easy to modify like SCRIPT_OPEN which can be changed to \<script [^>] > since it only ends with a > we can take any character except for the >. For items like SCRIPTLET that end with either a ?> or %>, then you would need a more complex regex similar to the one for a C-style comment / / (e.g., \/* [^*] *+ ([^\/*] [^*] *+)* \/ handles C comments). For the push/popMode stuff, you'll need to add a production before the token is used in the grammar. For example, in the script production, you would write PushScript .... Then you'll need to create a PushScript : [self scope: #SCRIPT]; . Similarly for popMode, you would create a production like Pop to add before that token. For now, you could define it as Pop : [self scope: #default];. If a stack is really needed, then the push and pop rules will need to be modified a little.