klauer / blark

Beckhoff TwinCAT ST (IEC 61131-3) code parsing in Python using Lark (Earley)
https://klauer.github.io/blark/
GNU General Public License v2.0
42 stars 5 forks source link

Suggestions on Implementing Custom XML Parser with `blark`s Grammar #20

Closed engineerjoe440 closed 2 years ago

engineerjoe440 commented 2 years ago

Hi there!

First of all, WOW! This project is awesome!

I just happened to stumble across this project recently. I'm tinkering with a linter for SEL (Schweitzer Engineering Laboratories) RTAC projects. The SEL RTAC uses CODESYS, so there's some intricacies that I need to handle in the different XML formatting. I had been working on my own Lark grammar until I stumbled across this work, and now, it seems sensible to leverage the great work you've already produced, and contribute where possible.

I've built logic that's capable of breaking down the RTAC XML to its constituent declaration and implementation sections (separately) and I'm wondering if you might be able to suggest a good way to interact with these components. For example, I have a section of implementation code as follows:

// Default return value to TRUE
SerializeJson := TRUE;

// Set to Root of Structure
Root();

SerializeJson := SerializeJson AND _serializedContent.Recycle();

// Set up Initial States for Indices
lastLevel := Current.LEVEL;
outerType := Current.JSON_TYPE;

And using a little Python as follows:


def lark_61131_implementation(**kwargs):
    """Generate Lark Parser for 61131 Implementation Sections."""
    return lark.Lark.open(
        'iec.lark',
        rel_to=__file__,
        parser='earley',
        maybe_placeholders=True,
        propagate_positions=True,
        **kwargs
    )

...

parser = lark_61131_implementation()
print(parser.parse(interface).pretty()) # `interface` is a string of the aforementioned 61131

Ultimately, that gives me a nice Lark error:

Traceback (most recent call last):
  File "/home/joestan/Documents/selint/selint/core/lexer/__init__.py", line 155, in <module>
    print(parser.parse(interface).pretty())
  File "/home/joestan/.local/lib/python3.8/site-packages/lark/lark.py", line 625, in parse
    return self.parser.parse(text, start=start, on_error=on_error)
  File "/home/joestan/.local/lib/python3.8/site-packages/lark/parser_frontends.py", line 96, in parse
    return self.parser.parse(stream, chosen_start, **kw)
  File "/home/joestan/.local/lib/python3.8/site-packages/lark/parsers/earley.py", line 266, in parse
    to_scan = self._parse(lexer, columns, to_scan, start_symbol)
  File "/home/joestan/.local/lib/python3.8/site-packages/lark/parsers/xearley.py", line 146, in _parse
    to_scan = scan(i, to_scan)
  File "/home/joestan/.local/lib/python3.8/site-packages/lark/parsers/xearley.py", line 119, in scan
    raise UnexpectedCharacters(stream, i, text_line, text_column, {item.expect.name for item in to_scan},
lark.exceptions.UnexpectedCharacters: No terminal matches 'S' in the current parser context, at line 3 col 1

SerializeJson := TRUE;
^
Expected one of: 
        * PROGRAM
        * TYPE
        * PROPERTY
        * FUNCTION
        * FUNCTION_BLOCK
        * ACTION
        * METHOD
        * VAR_GLOBAL

Before I go diving back into the grammar, I thought it might behoove me to ask exactly how these strings are intended to be parsed according to the grammar file you've provided. Should I be combining declaration and implementation into a single block? Are there other concepts you think I might be overlooking?

Thank you so much for your time!

engineerjoe440 commented 2 years ago

I guess I should add, the given error result from Lark did lead me to believe that I need the full declaration to lead the implementation, but wanted to get your (more knowledgeable) input.

Thank you!

klauer commented 2 years ago

Hi there @engineerjoe440,

Thanks for the nice comments! I've had fun with the project - maybe it can be of use to you, too.

Intended usage

Here's how blark is intended to be used at the moment. It assumes you can take your XML and translate it into a plain, non-XML structured text block like the following in the test suite: https://github.com/klauer/blark/blob/a4f241ac39cf19f216cd17b65445061916d0e869/blark/tests/test_transformer.py#L916-L924

That is, you can do the following:

In [1]: import blark
In [2]: blark.parse_source_code(
   ...:     """
   ...:                 PROGRAM ProgramName
   ...:                 VAR_INPUT
   ...:                     iValue : INT;
   ...:                 END_VAR
   ...:                 VAR_ACCESS
   ...:                     AccessName : SymbolicVariable : TypeName READ_WRITE;
   ...:                 END_VAR
   ...:                 iValue := iValue + 1;
   ...:             END_PROGRAM
   ...: """
   ...: )
Out[3]: SourceCode(items=[Program(name=Token('IDENTIFIER', 'ProgramName'), declarations=[InputDeclarations(attrs=None, items=[VariableOneInitDeclaration(variables=[DeclaredVariable(variable=SimpleVariable(name=Token('IDENTIFIER', 'iValue'), dereferenced=False), location=None)], init=TypeInitialization(indirection=None, spec=SimpleSpecification(type=Token('DOTTED_IDENTIFIER', 'INT')), value=None))]), AccessDeclarations(items=[AccessDeclaration(name=Token('IDENTIFIER', 'AccessName'), variable=SimpleVariable(name=Token('IDENTIFIER', 'SymbolicVariable'), dereferenced=False), type=DataType(indirection=None, type_name=Token('DOTTED_IDENTIFIER', 'TypeName')), direction=Token('READ_WRITE', 'READ_WRITE'))])], body=StatementList(statements=[AssignmentStatement(variables=[SimpleVariable(name=Token('IDENTIFIER', 'iValue'), dereferenced=False)], expression=BinaryOperation(left=SimpleVariable(name=Token('IDENTIFIER', 'iValue'), dereferenced=False), op=Token('ADD_OPERATOR', '+'), right=Integer(value=Token('INTEGER', '1'), type_name=None)))]))], filename=PosixPath('unknown'), raw_source='\n                PROGRAM ProgramName\n                VAR_INPUT\n                    iValue : INT;\n                END_VAR\n                VAR_ACCESS\n                    AccessName : SymbolicVariable : TypeName READ_WRITE;\n                END_VAR\n                iValue := iValue + 1;\n            END_PROGRAM\n')

With your code example

For your specific code example, you would have to format it in, say, a FUNCTION_BLOCK / END_FUNCTION_BLOCK to be compliant with the top-level grammar provided by blark:

In [1]: import blark

In [2]: blark.parse_source_code(
   ...:     """
   ...: FUNCTION_BLOCK test
   ...: // Default return value to TRUE
   ...: SerializeJson := TRUE;
   ...:
   ...: // Set to Root of Structure
   ...: Root();
   ...:
   ...: SerializeJson := SerializeJson AND _serializedContent.Recycle();
   ...:
   ...: // Set up Initial States for Indices
   ...: lastLevel := Current.LEVEL;
   ...: outerType := Current.JSON_TYPE;
   ...: END_FUNCTION_BLOCK
   ...: """
   ...: )
Out[2]: SourceCode(items=[FunctionBlock(name=Token('IDENTIFIER', 'test'), abstract=False, extends=None, implements=None, declarations=[], body=StatementList(statements=[AssignmentStatement(variables=[SimpleVariable(name=Token('IDENTIFIER', 'SerializeJson'), dereferenced=False)], expression=SimpleVariable(name=Token('IDENTIFIER', 'TRUE'), dereferenced=False)), MethodStatement(method=SimpleVariable(name=Token('IDENTIFIER', 'Root'), dereferenced=False)), AssignmentStatement(variables=[SimpleVariable(name=Token('IDENTIFIER', 'SerializeJson'), dereferenced=False)], expression=BinaryOperation(left=SimpleVariable(name=Token('IDENTIFIER', 'SerializeJson'), dereferenced=False), op=Token('LOGICAL_AND', 'AND'), right=FunctionCall(name=MultiElementVariable(name=SimpleVariable(name=Token('IDENTIFIER', '_serializedContent'), dereferenced=False), dereferenced=False, elements=[FieldSelector(field=SimpleVariable(name=Token('IDENTIFIER', 'Recycle'), dereferenced=False), dereferenced=False)]), parameters=[None]))), AssignmentStatement(variables=[SimpleVariable(name=Token('IDENTIFIER', 'lastLevel'), dereferenced=False)], expression=MultiElementVariable(name=SimpleVariable(name=Token('IDENTIFIER', 'Current'), dereferenced=False), dereferenced=False, elements=[FieldSelector(field=SimpleVariable(name=Token('IDENTIFIER', 'LEVEL'), dereferenced=False), dereferenced=False)])), AssignmentStatement(variables=[SimpleVariable(name=Token('IDENTIFIER', 'outerType'), dereferenced=False)], expression=MultiElementVariable(name=SimpleVariable(name=Token('IDENTIFIER', 'Current'), dereferenced=False), dereferenced=False, elements=[FieldSelector(field=SimpleVariable(name=Token('IDENTIFIER', 'JSON_TYPE'), dereferenced=False), dereferenced=False)]))]))], filename=PosixPath('unknown'), raw_source='\nFUNCTION_BLOCK test\n// Default return value to TRUE\nSerializeJson := TRUE;\n\n// Set to Root of Structure\nRoot();\n\nSerializeJson := SerializeJson AND _serializedContent.Recycle();\n\n// Set up Initial States for Indices\nlastLevel := Current.LEVEL;\nouterType := Current.JSON_TYPE;\nEND_FUNCTION_BLOCK\n')

In [3]: func = _2.items[0]   # _2 is the output above; pick out the FunctionBlock instance

In [4]: print(func)
FUNCTION_BLOCK test
    // Default return value to TRUE
    SerializeJson := TRUE;
    // Set to Root of Structure
    Root();
    SerializeJson := SerializeJson AND _serializedContent.Recycle(None);
    // Set up Initial States for Indices
    lastLevel := Current.LEVEL;
    outerType := Current.JSON_TYPE;
END_FUNCTION_BLOCK

In [5]: summary = blark.summary.CodeSummary.from_source(_2)
# The summary object then has some easier information to poke around with

If you have variable declarations, you would just add them to after the FUNCTION_BLOCK, ensuring you start them with a VAR (or VAR_INPUT, etc) and end them with an END_VAR.

Why not XML instead?

A bit of an aside - there's a bit of a historical reason for why I like the plain code instead of the vendor-specific XML or even PLC open XML:

OK, but I just want to use your grammar...

If the Python dataclasses aren't useful to you, or you want to roll your own, by all means have fun with it! Do keep in mind it's a lot of work (3k+ lines of relatively succinct code) to handle the entire language as-defined here. I'd be curious to hear what you'd want to see out of it to make it more useful, though.

You can tell lark to look at a specific grammar rule instead of the top-level one. So if you know that you have just a raw function block body (as above), you could use:

In [1]: import blark

In [2]: parser = blark.parse.new_parser(start="function_block_body")

In [3]: parser.parse(
    ...:     """// Default return value to TRUE
    ...: SerializeJson := TRUE;
    ...:
    ...: // Set to Root of Structure
    ...: Root();
    ...:
    ...: SerializeJson := SerializeJson AND _serializedContent.Recycle();
    ...:
    ...: // Set up Initial States for Indices
    ...: lastLevel := Current.LEVEL;
    ...: outerType := Current.JSON_TYPE;
    ...: """
    ...: )
Out[3]: Tree(...)

Anyway, I'm happy to chat about this sort of stuff as it's a fun side project for me. Let me know if you have thoughts on stuff to add/change/fix/etc.

engineerjoe440 commented 2 years ago

@klauer, thank you so much for this WONDERFUL explanation! This is terrific, and gives me so much information! I'm going to spend some time and play with this. I might have additional questions, but at this point, I feel comfortable poking around until I can ask something that's direct.

I really enjoy hearing your thought process, and I think it's very well put! What's so exciting is that it means that your working system will likely be functional for various vendor implementations, like the SEL one I'm trying to apply it to; thank you!