Allow semantics to be defined in Python

StoneyJackson commented 1 year ago

From @fosler

"Re-write plcc.py and the files in the Std directory so that it builds an interpreter with Python3 as the implementation language. We could then call it plccp, with the Java version called plccj."

High-level Design

We extend PLCC to allow an optional 4th section is grammar files. When 4 sections are present there purposes are as follows:

[lexical specification]
%
[syntactic specification in BNF]
%
[additional syntactic rules in Java]
%
[semantic specification in Python]

When four sections are detected, PLCC operates in a new hybrid Java-Python mode. In this mode, PLCC generates the following Java code:

A class hierarchy to represent parse nodes, which includes the recursive descent logic for parsing programs in the defined language, and receives Java snippets from the 3rd section of the grammar.
Other Java files defined in the 3rd section of the grammar.
A Scan class to interact with the scanner.
A Parse class to interact with the parser.
[new] A ParseJSONAast to serialize parse trees in Java-objects into JSON.

PLCC also generates the following Python code:

A class hierarchy to represent the parse nodes, which also receives Python snippets from the 4th section of the grammar.
Code to de-serialize ASTs encoded in JSON into python objects.
A Rep class to evaluate programs written in the defined language.

Architecture Diagram

arch

@startuml
file "grammar.py.plcc" as spec
control plcc.py
file source
package Java {
  control Scan
  control Parse
  component "Class hierarchy for parse tree" as TreeJ {
    component parser
    component syntax_checking
  }
  file Tokens
}
package Python {
  control rep.py
  component "Class hierarchy for parse tree" as TreeP {
    component semantics
  }
}
file "tree.json" as TreeJSON

source -> Scan
Scan .> Tokens
Tokens -> Parse
Parse .> TreeJSON
TreeJSON -> rep.py
TreeP --> rep.py

spec -> plcc.py
plcc.py ..> Java
plcc.py ..> Python

Java -[hidden]> TreeJSON

TreeJ --> Parse

@enduml

Pro-Con of Architecture

Pros

Java and Python support are maintained together and remain in-synch.
scan and parse are shared between the Java and Python (because they are really Java)
- Behavior is identical in either world.
- Both worlds benefit from bug fixes and improvements to these components.
- Save work reinventing these components.

Cons

Changes to scan and parse become more challenging. Mitigations:
- Automated tests for both worlds
- Decompose plcc into more independent components with language-agnostic, well-defined interfaces; e.g., scan < source | parse | rep and plcc.py itself. These can be slowly introduced over time, as/if needed.
(others?)

Working Plan

The current idea is to evolve the existing plcc system to support Python semantics.

1. Generate JSON AST [DONE; thank you @Rarity-Belle and @WilliamBowery!!]

The idea is to have Parse.java accept a new option that will cause it to print an AST in a JSON format. This is a new feature that will not break existing code. Why JSON? Python comes with a library that can parse JSON into a native data structure (nested dicts and lists and primitive types).

2. Add new semantics section to grammar files

Add an optional 4th section to grammar files. The 4th section will be populated by Python semantics. When a grammar files is written with four section, its structure is as follows:

lexical specification
%
syntactic specification
%
syntactic checks/helpers written in Java
%
semantics written in Python

Existing 3-section grammars should function normally.

When a 4th section is detected (even when empty), it will generate the Java code needed to generate JSON ASTs. Note that the 4rd section could also be empty, and it would behave the same.

3. Generate a parallel Python class hierarchy

plcc.py currently generates a Java class hierarchy of parse node types based on the BNF rules in the syntactic specification of a given grammar file. It also generates a recursive decent parsing algorithm that is implemented across this hierarchy in static methods. plcc.py also injects the semantics code from the semantics specification of a grammar file into these classes. When the parser is ran on a program, the parser instantiates objects from these classes and connects them to form a parse tree.

Now, when the 4th section is present, we need plcc.py to also generate a parallel class hierarchy in Python. It does not need code to parse the original program. But it does need to support construction of a parse tree of object given an AST in JSON.

4. Load JSON AST in Python

Write a new python function load_json_ast(json) that takes a JSON AST as a string and returns an object tree such that the nodes are appropriate instances of the parallel class hierarchy and correctly connected. This function is generated when there is a 4th section present.

5. Inject semantics into Python

In this step, we inject semantics from the 4th section of the grammar file into the Python class hierarchy.

6. Write a Python REP loop

In this last step, we write a Python version of the REP loop that uses everything that came before to run a program in a language that was defined using Python semantics.

(based on an email between @StoneyJackson @jashelio @fosler)

When we are done with the Python stuff, the Python stuff should contain a REP loop with the following loop body:

Read a program as a string.
Calls ParseJsonAst to parse that string program into a JSON AST.
Deserializes that JSON AST into Python objects.
Calls "$run()" on the root object.
Prints the result.

EDIT: [Jan 22, 2024] Reorder the Working Plan, eliminating the need for a new --python option. EDIT: [Feb 15, 2024] Add Rep loop algorithm.

StoneyJackson commented 1 year ago

@fosler @jashelio (this is on a GitHub issue; and replying to it will add your response to the issue)

Question 1: Should plccp be a separate new system based on the existing plccj? Or should plccp and plccj share components (e.g., scanner)?

Part of me likes the idea of shared reusable components, but that would require a redesign of the existing system. For example, if we want a reusable scanner, it would probably become a separate command-line application that takes a stream of characters on standard input and produces stream of tokens on standard out. If we want to be reuse purists, we could even make a separate parser tool that takes that stream of tokens and produces a parse tree on standard out in a format like JSON or YAML (basically a language-agnostic internal representation of the parse tree). Last, each "semantics analyzer" (Java and Python) reads this parse tree and constructs the corresponding object structure in then kicks of the semantic analysis. Although this appeals to the software designer in me, it would be a lot of work.

Building a separate system (plccp) based on the current (plcc/plccj) is probably the fastest way to get what we want. Basically, find all the Java and replace it with equivalent Python. I'm sure that's an oversimplification and there will be devils in the details. But I think this would lead to two systems that are reasonably maintainable because they are isolated from each other.

So I think I'm leaning to separate systems, and I think that's what @fosler was describing. What do others think?

Question 2: assuming "separate" is the answer to the previous, then should we create a new repository for plccp?

I'm thinking yes. This would allow separate version numbers for the two; allowing development to move at separate paces.

It is possible to many two projects with different version numbers in the same repository. But it's more complicated, and not something I've personally done.

StoneyJackson commented 1 year ago

Orient ourselves
1. Roughly understand src/plcc.py
2. Try running plcc again, generate a lexer, look at code generated.
Lexer/scanner/scan
1. Unit tests for existing system (pinning/characterization tests)
2. Start with existing src/plcc.py, modify to generate python scanner.
3. Update src/scan to run the scan.py
Parser/bnf/parse
Semantics/rep

To Do

[x] SJ: configure gitpod.io
[x] RV, MM: Create gitpod.io accounts
[x] RV, MM: review src/plcc.py (the lex part)
[x] RV, MM: review src/Std/Scan.java
[ ] RV, MM: maybe try existing plcc on gitpod.io

StoneyJackson commented 1 year ago

The goal is to let students write semantics in Python rather than in Java. The code fragments in the semantics section do not interact with the scanner or parser directly. They only interact with parse tree nodes and their contents. So theoretically, we might be able to reuse PLCC's scanner and parser, and the only parts we MUST write those parts that generate classes based on the BNF rules and the semantics section.

So here is my crazy idea...

When plccpmk (note the p in the middle) runs, it creates a new grammar file that does not have the semantics section and runs plccmk on that file. This generates all the Java stuff including Scan.java and Parse.java. Students can interact with scan and parse as normal, and they (and plccmk) behave as they did before.

After running plccmk, plccpmk then generates Python classes from the grammar rules, and provides a special rep.py. This rep.py uses Parse.java with it's trace option to parse input strings into a parse tree. rep.py parses the text-based tree produced by Parse.java and constructs the parse tree as python objects, then evaluates the tree.

The advantages of this approach are:

Minimizes the amount of work needed to be done to achieve our goal.
The behavior of the scanner and parser are identical to the original plcc because they are the original.

The disadvantages of this approach are:

It adds complexity.
plccp would depend on plcc and especially the output of Parse.java in trace mode. So making changes to plcc and/or the output of Parse.java make break plccp.
Java is still needed, which would not be the case if we made a pure Python implementation.

StoneyJackson commented 12 months ago

DECIDED: We should augment plcc rather than build a separate system.

StoneyJackson commented 8 months ago

Consider a reduced scope implementation of Python hierarchy. For example, only handle ::= with unique LHS and only tokens on the RHS.

StoneyJackson commented 7 months ago

In the new hybrid Java-Python mode, with a 4-section grammar... I'm not clear on how and when code in the 3rd section is executed.

In a 4-section grammar, the 3rd section contains Java code that express syntactic rules that cannot be expressed, or cannot elegantly be expressed, in PLCC's BNF. But how and when do these fragments get executed?

In a 3-section grammar, one overrides $run() (or toString()) in the class of the start symbol and call whatever they want. $run() is called on the start symbol by Rep. So if you want to run additional syntactic checks before evaluating the parse tree for its semantics, you instrument $run() to call your syntactic checks before calling your semantics evaluating code.

But in a 4-section grammar, the parser does NOT have a hook for running additional syntactic checks. Do we need to add a new one (e.g., $check()), or do we use the same one $run()? In either case, I assume we need to add a call to it from Parse. Adding a new hook might make it easier to remain backwards compatible. There would just be a new method that does nothing by default, and is always called by Parse. If we repurpose $run() for checks, then language designers don't have to learn/remember to a new symbol when moving to 4-section grammars ("always use $run()"); but code would have to detect when to call $run() from Parse and when it shouldn't.

Thoughts?

/cc @jashelio @fosler

jashelio commented 7 months ago

I have struggled to wrap my mind around what the “Spring 2024 project” version does. But my understanding is that (1) PLCC still generates Java classes, and (2) when the resulting Java code (containing scanner and parser) is run against code in the defined language, Java objects representing the AST are still created. But then, (3) you instruct that tree to build a replica of itself in JSON. (4) Python code then uses the JSON spec to rebuild the ADT in the Python environment, which was partially generated by PLCC and partially by the semantic code in the fourth section of the grammar file. Finally (5) the Python code is run and the user’s program written in the defined language “executes”, whatever that means.So if you are writing “syntax checks” in Java, that tells me that these checks are in the constructors of the original Java code classes built in step (1). Here is an example from our modified V6 language in ourPLCC/languages. (I had to screenshot it because my iPad wouldn’t let me copy and paste text from the GitHub window.)I personally am totally happy with this approach. Therefore, I am willing to leave things alone for now until we have a full Python-based pipeline, where the problem would solve itself.However, if folks don’t like the “:init” way of doing these checks, then my second choice would be the $check() solution.Jim

[edit: removed copy of previous messages]

StoneyJackson commented 7 months ago

@jashelio Your screenshot didn't come through. You should be able to drag and drop screenshots into a github comment. Although this might not be easy on ipod.

Your suggestion of using constructors sounds familiar now. Thanks for reminding me. So we could write helpers as normal and then hook them in using Class:init. I think I understand this approach. And now we have captured it in this issue. So maybe I'll remember next time :).

StoneyJackson commented 7 months ago

Another issue for our meeting on Friday...

How will "includes" work with 4 sections? If someone puts an include in the 3rd section (Java), that's different than if they put it in the 4th section (Python). I/we, need to think about this.

StoneyJackson commented 7 months ago

In our meeting, we talked about include. Here is a quick summary:

In the current system, there are two includes: the classic include and the #include.
The classic include is only allowed in the semantics section.
include is allowed in any section.
The difference between the classic include and the #include is that the #include immediately reads in the include file at the location of the #include; whereas the classic include queues the named file and processes this queue after processing the semantics section as an extension of the semantics section. The trouble is, the most obvious way of implementing 4-section grammars leads to classic includes being processed as an extension of the 4th section (which is Python, not Java).
Until we figure out a nice solution for classic includes, we suggest that the 4-section effort continue by assuming that no classic includes exist, and only #includes are used.

StoneyJackson commented 7 months ago

Moving @jashelio comment from https://github.com/ourPLCC/plcc/pull/96 to https://github.com/ourPLCC/plcc/issues/52.

Perhaps I can be accused of being to much of an abstractionist, but here goes.

I feel that the following idea

If there are three sections, then there is one section with all pre-execution checks and semantics in Java. If there are four sections, then there is a section in Java with pre-execution checks and another section in Python with semantics.

is short-sighted. I personally would prefer that we think in the following language-independent way.

If there are three sections, then there is one section with all pre-execution checks and semantics. If there are four sections, then there is a section with pre-execution checks and another section with semantics.

And that we actually have options on the command line to say what languages are being used. Eventually we could change the section markers in the grammar file to say

% Java or % Python

to get rid of those pesky command line options.

Just a thought, Jim

StoneyJackson commented 7 months ago

PRs are intentionally short-sighted. It helps ensure their success by avoiding scope creep and the moving target problem. I want them to be small and focused. Hopefully each PR is a step in the right direction. Whether or not they are in the right direction, we learn from them, update our design and plans here, and then we try to take another small step.

This is the way.

The Mandalorian

ourPLCC / plcc