javacc21 / javacc21

JavaCC
Other
70 stars 16 forks source link

Injections make multi-language code generation support harder. #74

Open vsajip opened 3 years ago

vsajip commented 3 years ago

Injections surely are a nice way to organize code in one's node classes. However, they do present some problems if one wants to generate code in other languages from a grammar. It's one thing to translate lexer and parser fields and methods to e.g. Python - they will not change very often. However, as soon as you start introducing methods in injections that drive the parsing process itself, you have to duplicate that logic in every language, from just a knowledge of Java and the grammar. Case in point: if SCAN conditions use node.isAssignableTo(), then a parser generated in Python would need to translate that call when generating the code for the SCAN, and there being an implementation of isAssignableTo() in relevant node classes implemented in Python, and avoiding the need to do that non-trivial amount of work would mean one of:

  1. JavaCC would have to translate essentially arbitrary Java classes (injected methods and fields) from the grammar into the code generation language. This is not a trivial exercise! Even if the current Java AST were more translation-friendly,
  2. Users would have to duplicate code for these injections in different languages. This strikes me as user-hostile - I am pretty sure users would view such a requirement as lame!
  3. Limits can be placed on logic-based lookahead (vs. token lookahead) if multi-language support is wanted, to avoid this sort of issue. At the very least, this might need to be done for grammars used internally in JavaCC.
  4. Provide some base additional functionality to Node/BaseNode which will provide a less general, but more easily translatable, framework for lookaheads etc.
  5. Some other solution that I haven't thought of!
revusky commented 3 years ago

Well, maybe it is 5. Have you thought about the implications of the fact that we have a preprocessor? https://javacc.com/2021/02/01/javacc-21-has-a-preprocessor/

Consider:

if python

       SCAN {some python code} =>
    #elif java
        SCAN {some java code} =>
     #endif
         Foo Bar Baz

I agree that translating arbitrary Java code to Python (or anything else) is pretty non-trivial. In fact, it might just be a bridge too far. And so, the conversation we were having about the Java AST not being translation-friendly might not be so important. Though, that said, I'm glad you brought it up, because it caused me to revisit that part of the Java grammar and there are some errors there! But there are all these weird things in Java that nobody ever uses, but are in the spec. Like you can put type arguments in a MethodReference. Like: Foo::<type1,type2>bar

But I can run the parser over all 14000+ source files in JDK 16 src.zip and there is no problem. Because not a single file has type parameters in a method reference.

Another thing I corrected in the last few days, less than a week ago, is that you can put annotations in front of a type parameter, but not in front of a type argument. Something like: class MyClass <@SomeAnnotation SomeType> ... is okay, but you can't write: class MyClass extends List<@SomeAnnotation SomeType>...

But I had it reversed! Until a few days ago, my Java grammar allowed annotations on type arguments but not on type parameters! But it's the other way round! But even parsing through thousands upon thousands of java source files in the wild, I never hit that issue and corrected it. Why? Because nobody puts annotations in those spots anyway!

I suppose it's safe to say that if nobody uses a language feature in all the thousands of JDK source files, it's probably not a very useful feature. But... if the spec says you can put an annotation there, then.... This stuff does drive one crazy, but it is kind of fun in a weird, masochistic way somehow....

I've added a lot of various features to JavaCC but all of them were motivated by real use cases. The thing is that JavaCC is written in JavaCC (largely anyway) so I frequently add new features to support internal development. I added the preprocessor because I was thinking about starting work on a C# parser, but also I figured it would eventually be needed for this kind of situation.

vsajip commented 3 years ago

Consider:

if python

SCAN {some python code} =>

elif java

SCAN {some java code} =>

endif

Foo Bar Baz

Well, I consider that being akin to option 2 in the above list. If the Java code relies on some injected method, the Python code has to do something equivalent; this is essentially no different to having injected methods needing to be added by hand for different languages. Case in point: if the Java code is foo.isMethodCall(), then the Python code needs to have the equivalent of that Java method somewhere, somehow.

And so, the conversation we were having about the Java AST not being translation-friendly might not be so important.

Well, I'm not sure about that. You might still need to translate Java calls to parser/lexer methods (which would have a translated equivalent provided). In fact if you developed a Java grammar containining only LOOKAHEAD(N)-type actions (no custom logic-based lookahead) and actions that contain calls to BaseNode/Token/non-injected parser methods, then you could have a JavaCC that e.g. generated Python code from a grammar. My testing with the JSON grammar so far seems to confirm that this is feasible, at least at the level of a smoke test.

Like you can put type arguments in a MethodReference. Like: Foo::<type1,type2>bar

That's a generic method specialization, isn't it? I believe C++ and C# have them too. They're not commonly used, I'll grant you - I think you might have to run the parser on the Java TCK test sources to catch all these corner cases.

But I can run the parser over all 14000+ source files in JDK 16 src.zip and there is no problem. Because not a single file has type parameters in a method reference.

That's not conclusive. The Python standard library doesn't use all of the features of the Python language, for example. However, the aggregate of the sources used to test the compiler/parser in the test suite should. Likewise for Java and other languages, too, I imagine.

I've added a lot of various features to JavaCC but all of them were motivated by real use cases. The thing is that JavaCC is written in JavaCC (largely anyway) so I frequently add new features to support internal development. I added the preprocessor because I was thinking about starting work on a C# parser, but also I figured it would eventually be needed for this kind of situation.

No doubt, and I'm not quibbling about the usefulness of features like INJECT. Just pointing out the problems they pose for cross-language code generation. It might make the JavaCC internals less convenient if you had to dispense with logic-based lookaheads calling into injected methods, but it seems like it might be needed.

By the way, hacking at it produces some interesting results, though without a better AST, the hack is pretty fragile. Here are some Java expressions taken from the internal Java grammar, and their generated Python equivalents:

1. # "lhs.isAssignableTo()" -> lhs.is_assignable_to()
2. # "lhs.isMethodCall()||lhs.isConstructorInvocation()||lhs.isAllocation()" -> lhs.is_method_call() or lhs.is_constructor_invocation() or lhs.is_allocation()
3. # "isParserTolerant()||permissibleModifiers.contains(getToken(1).getType())" -> self.is_tolerant or (self.get_token(1).type) in (permissible_modifiers)
4. # "currentLookaheadToken==null&&!((Expression)peekNode()).isAssignableTo()" -> self.current_lookahead_token is None and not self.peek_node().is_assignable_to()
5. # "currentLookaheadToken!=null||((Expression)peekNode()).isAssignableTo()" -> self.current_lookahead_token is not None or self.peek_node().is_assignable_to()
6. # "getToken(1).getType()!=TokenType._DEFAULT" -> self.get_token(1).type != TokenType._DEFAULT

Some of these would work e.g. Nos. 3 & 6, as they user base parser functionality only. Nos. 1, 2, 4 & 5 generate correct code, but would fail at runtime unless the injected methods such as isAssignableTo(), isAllocation() etc. were translated over, too - and they're essentially arbitrary Java, so it's a bit hard to do that!

vsajip commented 3 years ago

By the way, just as a data point, it looks like ANTLR supports multiple languages without e.g parsing the actions, perhaps just tokenizing to keep track of curlies.

revusky commented 3 years ago

By the way, just as a data point, it looks like ANTLR supports multiple languages without e.g parsing the actions, perhaps just tokenizing to keep track of curlies.

Well, frankly, to me, that looks totally lame. It means that if you have a typo in a code action (unless that typo is a mismatched delimiter, in which case it does catch it...) then it generates the whole shebang and the typo is then caught by the compiler, except that the compiler is giving you an error location based on the generated code, not the real source of the error.

Aside from obliging you to trace back the source of the error (i.e. where it needs to be fixed) you are notified of it at a much later stage of the build process. Depending on the size of your project, you would detect the typo in your code action maybe after 10, 15, 20 seconds. With the way JavaCC21 does it, it hits a parse error right when it tries to parse the grammarfile which is typically the first thing the build does, so you would know the error (and have the correct location) pretty much instantaneously.

The other problem is that, in terms of generating Java code, this is working. So, for Python to be presented as an "equal citizen" in the overall system, we have to also be able to parse Python code. (And I'm working on that!)

So I have always taken it as a given that we need to be able to parse blocks of code in whatever target languages we're supporting. Note, however, that other people have not reasoned like this. Not just the ANTLR people. Look here: https://github.com/javacc/javacc/blob/master/src/main/javacc/JavaCC.jj#L215-L242

This is the legacy project's approach to parsing non-Java code, i.e. C++ or C#. You can see where they use those two methods in that file, on lines 942, 2487, and 2609. So, yeah, it's true that both legacy JavaCC and ANTLR think that just matching tokens is a viable way of "parsing" code blocks, but I think that may just be a case of not-so-great minds thinking alike!

revusky commented 3 years ago

2. Users would have to duplicate code for these injections in different languages. This strikes me as user-hostile - I am pretty sure users would view such a requirement as lame!

Well, but let's be realistic a moment. How many end-users of the tool have the requirement to generate a parser in multiple programming languages?

As you know, I'm working on a Python grammar, and I assume that, ultimately, it will be able to generate a parser (for Python) in Java and also in Python itself. But that is for our internal use, and in doing that, sure, we face problems that few end users face, because they typically only want to generate a parser in one language.

But, you know, this is another reason to do this, I mean that we can demonstrate the capability of having a maintainable grammar file that can be toggled to generate either a Java parser or a Python parser. I mean, that would be a very strong demonstration of good design, separation of concerns and so forth. But even so, it's hard to imagine that it won't be significantly more complex than just having a grammarfile that generates a parser in one language.

So, basically, I see the problems, and I am up for attacking them. I think it will be challenging, but still manageable.

vsajip commented 3 years ago

Well, frankly, to me, that looks totally lame.

Yes, it is in a sense, but it's also pragmatic in the sense that the core functionality is to provide an easy way of providing a parser with support for multiple languages without the need to have robust parsers in the tool for all those languages. Even JavaCC wouldn't catch semantic errors (e.g. misnamed package in an injected import, misnamed field or method which looks syntactially OK), would it?

With the way JavaCC21 does it, it hits a parse error right when it tries to parse the grammarfile which is typically the first thing the build does, so you would know the error (and have the correct location) pretty much instantaneously.

That is indeed useful, but it'll only catch syntax errors in snippets, right? Not semantic ones?

The other problem is that, in terms of generating Java code, this is working. So, for Python to be presented as an "equal citizen" in the overall system, we have to also be able to parse Python code. (And I'm working on that!)

Sure. But remember that's just one of the requirements - the question of providing injected functionality and providing the ability to specify code language for snippets and multiple snippets for a site needs to be considered, however that is ultimately addressed.

both legacy JavaCC and ANTLR think that just matching tokens is a viable way of "parsing" code blocks, but I think that may just be a case of not-so-great minds thinking alike!

To be charitable, it may well be that they took a pragmatic view - "the best is the enemy of the good", and all that.

Well, but let's be realistic a moment. How many end-users of the tool have the requirement to generate a parser in multiple programming languages?

Not many, but the time when one is considering supporting multiple languages - that's the time to consider whether this functionality could be provided, and at what cost in terms of complexity/technical debt/time to develop and all those usual trade-offs.

But, you know, this is another reason to do this, I mean that we can demonstrate the capability of having a maintainable grammar file that can be toggled to generate either a Java parser or a Python parser. I mean, that would be a very strong demonstration of good design, separation of concerns and so forth.

Yes, indeed, it validates the ideas behind the tool a lot!

But even so, it's hard to imagine that it won't be significantly more complex than just having a grammarfile that generates a parser in one language.

Slightly more complex, perhaps, but not necessarily deal-breakingly so. It's worth trying it, IMO.

So, basically, I see the problems, and I am up for attacking them. I think it will be challenging, but still manageable.

OK, great! I will provide what support I can!

By the way, here's a Java production method from the current internal Java grammar, with the corresponding one I generated from my playing about:

    // Java.javacc:161:1
    final public void EmptyDeclaration() throws ParseException {
        if (trace_enabled) LOGGER.info("Entering production defined on line 161 of Java.javacc");
        if (cancelled) throw new CancellationException();
        String prevProduction= currentlyParsedProduction;
        this.currentlyParsedProduction= "EmptyDeclaration";
        if (pendingRecovery) {
            if (debugFaultTolerant) LOGGER.info("Re-synching to expansion at: Java.javacc:161:21");
            recover$EmptyDeclaration();
        }
        EmptyDeclaration EmptyDeclaration4= null;
        if (buildTree) {
            EmptyDeclaration4= new EmptyDeclaration();
            EmptyDeclaration4.setInputSource(getInputSource());
            openNodeScope(EmptyDeclaration4);
        }
        ParseException parseException254= null;
        int callStackSize255= parsingStack.size();
        try {
            if (false) throw new ParseException("Never happens!");
            // Code for RegexpStringLiteral specified at:
            // Java.javacc:161:21
            EnumSet<TokenType> followSet260= null;
            if (outerFollowSet!=null) {
                followSet260= follow_set$Java_javacc$161$21.clone();
                followSet260.addAll(outerFollowSet);
            }
            consumeToken(SEMICOLON, false, followSet260);
        }
        catch(ParseException e) {
            parseException254= e;
            if (isParserTolerant()) this.pendingRecovery= true;
            throw e;
        }
        finally {
            restoreCallStack(callStackSize255);
            if (EmptyDeclaration4!=null) {
                if (parseException254== null) {
                    closeNodeScope(EmptyDeclaration4, true);
                }
                else  {
                    if (trace_enabled) LOGGER.warning("ParseException: "+parseException254.getMessage());
                    closeNodeScope(EmptyDeclaration4, true);
                    EmptyDeclaration4.setDirty(true);
                }
            }
            this.currentlyParsedProduction= prevProduction;
        }
    }
    EmptyDeclaration_FIRST_SET = frozenset({
        TokenType.SEMICOLON
    })

    # Java.javacc:161:1
    def parse_EmptyDeclaration(self):
        if self.trace_enabled:
            logger.info('Entering production defined on line 161 of Java.javacc')
        if self.cancelled:
            raise CancellationException('operation cancelled')
        prev_production = self.currently_parsed_production
        self.currently_parsed_production = 'EmptyDeclaration'
        if self.pending_recovery:
            if self.debug_fault_tolerant:
                logger.info('Re-synching to expansion at: Java.javacc:161:21')
            self.recoverΣEmptyDeclaration()

        EmptyDeclaration4 = None
        if self.build_tree:
            EmptyDeclaration4 = EmptyDeclaration(self.input_source)
            self.open_node_scope(EmptyDeclaration4)
        pass

        parseException254 = None
        callStackSize255 = len(self.parsing_stack)
        try:
            pass  # in case there's nothing else in the try clause!
            # Code for RegexpStringLiteral specified at:
            # Java.javacc:161:21
            pass

            follow_set260 = None
            if self.outer_follow_set is not None:
                follow_set260 = follow_setΣJava_javaccΣ161Σ21.clone()
                follow_set260.extend(self.outer_follow_set)
            self.consume_token(TokenType.SEMICOLON, False, follow_set260)

        except ParseException as e:
            parseException254 = e
            if self.is_tolerant: self.pending_recovery = True
            raise
        finally:
            self.restore_call_stack(callStackSize255)
            if EmptyDeclaration4:
                if parseException254 is None:
                    self.close_node_scope(EmptyDeclaration4, True)
                else:
                    if self.trace_enabled:
                       logger.warning('ParseException: %s', parseException254)
                    self.close_node_scope(EmptyDeclaration4, True)
                    EmptyDeclaration4.dirty = True

            self.currently_parsed_production = prev_production

I hope you'll agree that it's not too shabby! The various spurious no-op pass statements are needed because code generated in one place has no way of knowing if a generated block has code because something else inserted some, and sticks in a pass to stop the block being empty!

I also generated token lexical actions (generated OK here, but could be flaky on other inputs). First the Java version, then my Python version:

    private Token tokenLexicalActions(Token matchedToken, TokenType matchedType) {
        switch(matchedType) {
            case FORMAL_COMMENT_START:
            backup(1);
            break;
            default:
            break;
        }
        return matchedToken;
    }
    def token_lexical_actions(self, matched_token, matched_type):
        if matched_type == TokenType.FORMAL_COMMENT_START:
            self.backup(1)

        return matched_token
revusky commented 3 years ago

With the way JavaCC21 does it, it hits a parse error right when it tries to parse the grammarfile which is typically the first thing the build does, so you would know the error (and have the correct location) pretty much instantaneously.

That is indeed useful, but it'll only catch syntax errors in snippets, right? Not semantic ones?

Well, yeah, that's true. A very classic error in Java coding, like neglecting to end a statement with a semicolon, that will be caught at parse-time, but misspelling a variable, like writing colour instead of color or vice versa will only be caught by the compiler. I guess that all that can be said about that is that it's better than nothing!

But the question also is that if we're going to have the capability to handle parsing python code, for injections for example, then we've already assumed that cost and we're getting the ability to parse the code actions for free basically anyway. Aside from that, I have to think that a robust python parser is both (a) separately quite useful for other projects out there and (b) a significant integrated functional test of the overall system.

The main "advantage", as far as I can see, of not having the capability to parse python code is just that it's less work to do it that way!

both legacy JavaCC and ANTLR think that just matching tokens is a viable way of "parsing" code blocks, but I think that may just be a case of not-so-great minds thinking alike!

To be charitable, it may well be that they took a pragmatic view - "the best is the enemy of the good", and all that.

Well, whatever, but I guess what I would say about that is that there really is (at least for me, but really, for anybody with a... love of the craft...) a real satisfaction to be attained by really solving a problem correctly in a fundamental way -- as opposed to just something more like a kludge...

And the other thing is that, when you do really put in the machinery to solve a problem properly -- in this case specifically, parsing python code, but it could be something different in another case... -- this frequently leads to other things becoming "low-hanging fruit". Suddenly you see that you now have the tools to solve some other thing elegantly that you didn't before. Sometimes you even end up implementing appealing new features that hadn't even occurred to you before because you now see that it would be easy to do that. Something similar happens with code cleanup generally. Like, suppose you finally manage to refactor 1000 very messy lines of code into a couple of hundred lines that expresses the same thing more elegantly. Well, that's already directly beneficial, but frequently when you have a piece rewritten like that, you see very clearly that you could now take the thing in this direction or that direction, adding some nice feature... And these are things that, with the code in its previous messy state, wouldn't even occur to you.

All that said, I'm not some kind of total idealist. I understand that there are situations where one has very specific time/budget constraints and one cuts corners and so on. But in this case, we're mostly doing this for the pure creative satisfaction of doing it (I say "we" rather than "I" because I hope that's your position too!) so I think we should just really go for the full proper solution to things.

By the way, here's a Java production method from the current internal Java grammar, with the corresponding one I generated from my playing about:

As for this, well, I'm totally psyched! Based on what I'm seeing, I don't see why we couldn't have a usable PythonCC working in a couple of months even. Well, who knows. Depends on how much time you can devote to it. (It would take me vastly more time alone because, for one thing, I'd have to learn Python properly!) I will learn Python properly, of course, as a side effect of all this work, but having you on hand, I can do that faster as well!

And then the other thing is that, once we have two languages we target, it's surely less work to go from 2 to 3 than to go from 1 to 2. That's the same as with language learning. To learn French and Spanish, say, is nowhere near double the work that learning just one is. (I don't know if anybody has quantified it, but I suspect that learning both would be maybe 1.3 or 1.4x more work, not sure...but certainly not double) It occurs to me that this depends on how closely related the languages are. If the example is Chinese and Arabic, then.... BUT.. in terms of computer languages, they are quite related, for the most part, so I would say the French+Spanish simile is more appropriate than Chinese+Arabic. There is a huge amount of overlap between the basic constructs that the various languages provide, so... And once we've figured out how to support 2 languages properly, going to 3 is just more incremental work mostly, because the really difficult issues have been resolved. Also, if I need to write a parser for whatever new language... well, I'm just getting better at it each time!

vsajip commented 3 years ago

The main "advantage", as far as I can see, of not having the capability to parse python code is just that it's less work to do it that way!

But I'm not arguing against the ability to fully parse Python code internally, far from it! Having a proper level of Python grammar support would be great. I think you might have misunderstood me - I only mentioned curly-counting sort-of-in-passing because that's how ANTLR can support multiple languages with relatively little work. In the case where you need full-blown grammar support for a language before you can generate code in it, the upside is doing it properly and having any low-hanging fruit, and the downside is slower development and more effort. I'm certainly up for it, it's not like there are deadlines that have to be met!

revusky commented 3 years ago

I only mentioned curly-counting sort-of-in-passing because that's how ANTLR can support multiple languages with relatively little work

Well, to be absolutely clear about this, I certainly can conceive of the delimiter-counting hack as a stop-gap measure. I think I'll have full python parsing capability fairly soon, but maybe I'm being over-optimistic. I could certainly conceive of a situation where you move forward on your end just using this as a stop-gap, but also basically knowing that the full parsing capability will be in place at some point in the future. I suppose that this is how people can work in parallel on some fairly large system in which the different pieces fall into place when they do and...

What I find hard to conceive of is these people's idea (apparently) that this delimiter counting is a viable long-term solution. Now, that whole eatUpToRBrace method in the legacy JavaCC code, I was curious enough to trace back to when that method was written. I think it goes back to 2013. And in the 8 years since then, there is no ongoing effort to have a parser for either C# or C++.

To me, that just seems unserious. I guess that means that I'm a very serious guy, notwithstanding the fact that I do crack more jokes than most people in this milieu (and sometimes jokes that are in rather dubious taste...) but in terms of attacking these problems, I'm actually pretty serious! It seems to me that, in particular, when your project is a parser generator, you should not be afraid to write a parser here and there!

vsajip commented 3 years ago

Well, to be absolutely clear about this, I certainly can conceive of the delimiter-counting hack as a stop-gap measure.

Sure. It may allow progress on other fronts before a full parser is available, is all. It remains to be seen how exactly it will work for other languages than Java, since the Java grammar is currently integrated into JavaCC's. I presume there would have to be additional language-specific lexical/parsing states for each supported language so that switching in and out of actions would be seamless.