rslemos / cobolg

COBOL grammar
GNU General Public License v3.0
3 stars 1 forks source link

RPG lexer/parser #1

Open lppedd opened 3 years ago

lppedd commented 3 years ago

Hey! This is purely an informative issue. I'm considering writing a lexer and a parser for fixed and free format RPG code. How long did it take to get to a decent result? I see the commits spawns two years but I don't think it would take that long, right?

rslemos commented 3 years ago

Sorry for the late answer.

It was a pet project (but aspiring to be complete and high quality). I wrote it in bursts of smaller timespans. Usually I left it dormant for some time before I came up with a better idea to fix some issue. My project is dead by now, because we've got rid of all of our COBOL source code.

My nightmares were in two specific points:

  1. You should bear in mind that my lexer+parser solution had to cope with incomplete and possibly invalid input. Because back then I need a parser that could index all my code base. If you know C, it is the same pain to handle those #define and #include "statements": for a compiler it just needs to "act" upon those "statements" and carry on; for a parser used for indexing they should be treated as kind of formal statements (think of an IDE with "ctrl+click"'ng feature over a "#defined" macro for instance). THAT was the biggest issue.

  2. The fixed format IS ALSO an issue. Throuhgout the project I decided to change many times the strategy to handle it. I failed. My last insight was to understand that I had to have a very clear separation of Lexer and Parser, something that ANTLR4 allows (and somewhat encourages you to) but not to the extent I thought I was needing (that token numbering is a kind of dependency that is hard to tackle). Back then I filed an issue with them (https://github.com/antlr/antlr4/issues/1779) but it was largely ignored, and even today is left open without a comment.

Have those 2 points combined and you've got the mother of all nightmares: even your lexer should tokenize every single character in the input, never ignoring/hiding characters, and without introducing "phantom" markers (which at somepoint I tried). My last idea was to code the key parts of my lexer by hand (I mean, implement those ANTLR4 interfaces in java); and delegate the easy parts to a lexer generated by ANTLR4.

If your compiler doesn't have to handle invalid input (I mean, it could just spit an error and stop), it should be fairly easy to write one in some weeks (even with fixed format).

I wish you success in your enterprise.