hansemro / python-bsdl-parser

A Grako-based parser for IEEE 1149.1 Boundary-Scan Description Language (BSDL) files
Other
0 stars 0 forks source link

Exploring better ways to handle string concatenation in string-based rules #3

Open hansemro opened 1 year ago

hansemro commented 1 year ago

Issue:

Many string-based rules in bsdl.ebnf (ones that are enclosed in quotes) cannot handle string concatenation between tokens.

This is a consequence of many BSDL attributes being strings...

Example:

We can cause parsing issues by inserting string concatenation in areas where string rules in bsdl.ebnf do not expect them.

INSTRUCTION_OPCODE

Looking at the grako rules for INSTRUCTION_OPCODE, we see the following:

instruction_opcode_stmt = "attribute" "INSTRUCTION_OPCODE" "of" component_name
    colon "entity" "is" @:opcode_table_string semicolon ;
opcode_table_string = "&".{quote (comma).{[@+:opcode_description]} quote} ;
opcode_description = instruction_name:instruction_name left_paren opcode_list:opcode_list right_paren ;
opcode_list = (comma).{[@+:opcode]} ;
opcode = pattern ;

Looking at opcode_description, we see that it does not expect string concatenation to occur between instruction_name and the left parenthesis. So the following would not work in the above rules:

attribute INSTRUCTION_OPCODE of A : entity is
    "IDCODE" &
    "(101001)"

Additionally, opcode_list does not expect string concatenation between opcode patterns. So the following would not work either:

attribute INSTRUCTION_OPCODE of A : entity is
    "IDCODE (101001," &
    "010110)"

Proposed Solution:

For any string-based rules with more than one token, expect string concatenation to happen between tokens.

Treat any expression element between quote's as a string-based rule. In the INSTRUCTION_OPCODE example, opcode_description and opcode_list should be considered a string-based rule since they would be enclosed in quotes from opcode_table_string.

Example solution for INSTRUCTION_OPCODE:

First, let's recognize string concatenation as the pattern" & " and call it end_and_start:

end_and_start = quote '&' quote ;

With this new rule, we can catch repeating empty string concatenation between tokens ("token" & "" & "" & "" & "token") with token {end_and_start} token. (As you will see below, there are many uses for this rule.)

Then, for each rule, handle how and where string concatenation may occur as shown below:

instruction_opcode_stmt = "attribute" "INSTRUCTION_OPCODE" "of" component_name
    colon "entity" "is" @:opcode_table_string semicolon ;
opcode_table_string = "&".{
    quote
    {end_and_start}
    ({end_and_start} comma {end_and_start}).{
        [@+:opcode_description]
    }
    {end_and_start}
    quote} ;
opcode_description = instruction_name:instruction_name
    {end_and_start}
    left_paren
    {end_and_start}
    opcode_list:opcode_list
    {end_and_start}
    right_paren ;
opcode_list = ({end_and_start} comma {end_and_start}).{[@+:opcode]} ;
opcode = pattern ;

As a bonus, by handling the possible presence of string concatenation, we no longer need gather-optional expressions and can drop the square brackets. This helps improve TatSu compatibility (#2).

...
opcode_table_string = "&".{
    quote
    {end_and_start}
    ({end_and_start} comma {end_and_start}).{
        @+:opcode_description
    }
    {end_and_start}
    quote} ;
...
opcode_list = ({end_and_start} comma {end_and_start}).{@+:opcode} ;
...

Relevant Branches:

Tasks:

hansemro commented 1 year ago

string rule:

Currently, the string literal rule allows optional quotation marks, which is invalid since quotation marks must be present for each string component. We can fix this by moving the quotes outside the square brackets as shown below:

string = "&".{quote [@:?/[A-Za-z0-9\&'\(\)\[\]\*\,\-\+\.\:\;\<\=\>\_\/\t ]+/?] quote} ;

This no longer allows invalid string constructions like the following: