Support for easily defining subrules with optional parameters

vxgmichel commented 3 years ago

Use case

Here's a simplified version of the problem I ran into while attempting to define some ARM instructions in customasm. Let's say I have 3 opcodes A, B and C that I can tweak using a modifier corresponding to a condition, say X, Y and Z. Those two elements form an instruction when concatenated, e.g AY or CZ. When the condition is omitted, X is assumed (meaning A, B and C are valid instructions). Then a full instruction is crafted by combining the instruction defined above with an 8 bit integer (e.g AY 51).

Here's a working implementation in customasm:

#subruledef opcode
{
        A => 0x1
        B => 0x2
        C => 0x3
}

#subruledef condition
{
        X => 0xa
        Y => 0xb
        Z => 0xc
}

#subruledef instruction
{
        A  => 0x1a ; Default to condition X
        B  => 0x2a ; Default to condition X
        C  => 0x3a ; Default to condition X
        AX => 0x1a
        AY => 0x1b
        AZ => 0x1c
        BX => 0x2a
        BY => 0x2b
        BZ => 0x2c
        CX => 0x3a
        CY => 0x3b
        CZ => 0x3c
}

#ruledef
{
        {ins: instruction} {val: u8} => ins @ val
}

A 51  ; OK: 1a 33
B 51  ; OK: 2a 33
C 51  ; OK: 3a 33
AX 51 ; OK: 1a 33
AY 51 ; OK: 1b 33
AZ 51 ; OK: 1c 33
BX 51 ; OK: 2a 33
BY 51 ; OK: 2b 33
BZ 51 ; OK: 2c 33
CX 51 ; OK: 3a 33
CY 51 ; OK: 3b 33
CZ 51 ; OK: 3c 33

The problem here is that all instructions have been crafted manually, which can be tedious and error prone. Also, it can quickly grow much bigger with more realistic instruction sets.

Attempt 1 - Use an empty pattern

In order to reduce the amount of redundancy, I first tried to factorize the logic using the opcode and condition subrules. I naively used an empty pattern to bind a missing condition to X, although it is not supported:

#subruledef opcode
{
        A => 0x1
        B => 0x2
        C => 0x3
}

#subruledef condition
{
        X => 0xa
        Y => 0xb
        Z => 0xc
        _ => 0xa ; Default to X
}

#subruledef instruction
{
        {opc: opcode}{cnd: condition} => opc @ cnd
}

#ruledef
{
        {ins: instruction} {val: u8} => ins @ val
}

A 51  ; error: no match for instruction found
B 51  ; error: no match for instruction found
C 51  ; error: no match for instruction found
AX 51 ; error: no match for instruction found
AY 51 ; error: no match for instruction found
AZ 51 ; error: no match for instruction found
BX 51 ; error: no match for instruction found
BY 51 ; error: no match for instruction found
BZ 51 ; error: no match for instruction found
CX 51 ; error: no match for instruction found
CY 51 ; error: no match for instruction found
CZ 51 ; error: no match for instruction found

This doesn't work but I think it gives a good idea of what a straight forward definition for this use case could look like.

Attempt 2 - Distinct instruction subrules

In an attempt to work around the lack of empty pattern support, I simply split the instructions into two distinct rules:

#subruledef opcode
{
        A => 0x1
        B => 0x2
        C => 0x3
}

#subruledef condition
{
        X => 0xa
        Y => 0xb
        Z => 0xc
}

#subruledef instruction
{
        {opc: opcode}                 => opc @ 0xa ; Default to X
        {opc: opcode}{cnd: condition} => opc @ cnd
}

#ruledef
{
        {ins: instruction} {val: u8} => ins @ val
}

A 51  ; OK: 1a 33
B 51  ; OK: 2a 33
C 51  ; OK: 3a 33
AX 51 ; error: no match for instruction found
AY 51 ; error: no match for instruction found
AZ 51 ; error: no match for instruction found
BX 51 ; error: no match for instruction found
BY 51 ; error: no match for instruction found
BZ 51 ; error: no match for instruction found
CX 51 ; error: no match for instruction found
CY 51 ; error: no match for instruction found
CZ 51 ; error: no match for instruction found

This would still be a nice way of factorizing the logic, except it doesn't work either (at least for the conditional instructions).

Attempt 3 - Distinct instruction subrules with `-` separators

My third attempt has been to introduce extra separators as I noticed empirically that then can make a difference as how the instructions are parsed. In particular, I simply added a - character between the opcode and the condition:

#subruledef opcode
{
        A => 0x1
        B => 0x2
        C => 0x3
}

#subruledef condition
{
        X => 0xa
        Y => 0xb
        Z => 0xc
}

#subruledef instruction
{
        {opc: opcode}                  => opc @ 0xa ; Default to X
        {opc: opcode}-{cnd: condition} => opc @ cnd
}

#ruledef
{
        {ins: instruction} {val: u8} => ins @ val
}

A 51  ; OK: 1a 33
B 51  ; OK: 2a 33
C 51  ; OK: 3a 33
A-X 51 ; error: ambiguous nested ruleset
A-Y 51 ; error: ambiguous nested ruleset
A-Z 51 ; error: ambiguous nested ruleset
B-X 51 ; error: ambiguous nested ruleset
B-Y 51 ; error: ambiguous nested ruleset
B-Z 51 ; error: ambiguous nested ruleset
C-X 51 ; error: ambiguous nested ruleset
C-Y 51 ; error: ambiguous nested ruleset
C-Z 51 ; error: ambiguous nested ruleset

Unfortunately it doesn't work any better than the previous attempt but it produces a different error, about the ruleset being ambiguous.

Attempt 4 - Distinct instruction subrules with `-` separators and `,` terminators

In order to fix this ambiguity, I then add extra terminators to help the parser differentiate the two rules:

#subruledef opcode
{
        A => 0x1
        B => 0x2
        C => 0x3
}

#subruledef condition
{
        X => 0xa
        Y => 0xb
        Z => 0xc
}

#subruledef instruction
{
        {opc: opcode},                  => opc @ 0xa ; Default to X
        {opc: opcode}-{cnd: condition}, => opc @ cnd
}

#ruledef
{
        {ins: instruction} {val: u8} => ins @ val
}

A, 51   ; OK: 1a 33
B, 51   ; OK: 2a 33
C, 51   ; OK: 3a 33
A-X, 51 ; OK: 1a 33
A-Y, 51 ; OK: 1b 33
A-Z, 51 ; OK: 1c 33
B-X, 51 ; OK: 2a 33
B-Y, 51 ; OK: 2b 33
B-Z, 51 ; OK: 2c 33
C-X, 51 ; OK: 3a 33
C-Y, 51 ; OK: 3b 33
C-Z, 51 ; OK: 3c 33

This version actually succeeds at producing the correct binary, although it's a bit verbose.

Question

My overall question is then: would it be possible to bridge the gap between the naive but broken first attempt and the working but verbose fourth attempt?

There might also be other solutions that I overlooked :)

pol-rivero commented 3 years ago

It's possible to remove the , terminator by deleting the instruction subruledef:

#subruledef opcode {
        A => 0x1
        B => 0x2
        C => 0x3
}

#subruledef condition {
        X => 0xa
        Y => 0xb
        Z => 0xc
}

#ruledef {
        {opc: opcode} {val: u8}   => opc @ 0xa @ val
        {opc: opcode}-{cnd: condition} {val: u8}  => opc @ cnd @ val
}

A 51   ; OK: 1a 33
B 51   ; OK: 2a 33
C 51   ; OK: 3a 33
A-X 51 ; OK: 1a 33
A-Y 51 ; OK: 1b 33
A-Z 51 ; OK: 1c 33
B-X 51 ; OK: 2a 33
B-Y 51 ; OK: 2b 33
B-Z 51 ; OK: 2c 33
C-X 51 ; OK: 3a 33
C-Y 51 ; OK: 3b 33
C-Z 51 ; OK: 3c 33

However, I don't think there is any way to remove the - separator.

vxgmichel commented 3 years ago

@p-rivero Good catch! Unfortunately it doesn't help much in my case as applying your suggestion would still double the amount rules needed. That's because the operand format also splits into different patterns that end up tweaking some bits on the left side of the instruction. For more information:

the arm ruleset I implemented
the corresponding code I wrote

vxgmichel commented 3 years ago

I just ran into a simpler example while writing the ruleset for the full THUMB instruction set.

Consider the PUSH instruction, defined as:

#ruledef big_endian
{
    PUSH {rlist: u8}     => 0b1101010 @ 0b0 @ rlist
    PUSH {rlist: u8}, LR => 0b1101010 @ 0b1 @ rlist
}

PUSH 0b00001111         ; OK
PUSH 0b00001111, LR     ; OK

It works just fine. Now let's convert those instructions to little endian using the following trick:

#subruledef big_endian
{
    PUSH {rlist: u8}     => 0b1101010 @ 0b0 @ rlist
    PUSH {rlist: u8}, LR => 0b1101010 @ 0b1 @ rlist
}

#ruledef little_endian
{
    {val: big_endian}      => val[7:0] @ val[15:8]
}

PUSH 0b00001111        ; OK
PUSH 0b00001111, LR    ; error: `ambiguous nested ruleset` and `no match for instruction found`

It now fails with both the "ambiguous" and "no match" errors. Note that this trick usually works for the other instructions, it only fails in conjunction with this pattern of one instruction being an extended version of another one. I had to fix it by adding a discriminating character in front of the failing instructions:

#subruledef big_endian
{
    PUSH {rlist: u8}     => 0b1101010 @ 0b0 @ rlist
    !PUSH {rlist: u8}, LR => 0b1101010 @ 0b1 @ rlist
}

#ruledef little_endian
{
    {val: big_endian}      => val[7:0] @ val[15:8]
}

PUSH 0b00001111      ; OK
!PUSH 0b00001111, LR ; OK

Because of the little endian use-case, this is also related to issue #24.

hlorenzi commented 3 years ago

Alright, my latest commit should fix all of the above issues! 🎉 All of the attempts described above should work (both the ABC/XYZ and the PUSH attempts). The parser should now see rules and subrules as equal in status, so it shouldn't matter how you split up, nest, or organize your instructions.

Additionally, I've made an empty pattern available for subrules! The syntax looks like this:

#subruledef condition
{
  X => 0xa
  Y => 0xb
  Z => 0xc
  {} => 0xa ; Default to X
}

I'll be releasing this fix as a new version soon.

hlorenzi / customasm