I tried to write out the grammar of the assembly syntax in EBNF. This might help fixing bugs in the assembler, while also stating a formal syntax.

I am probably missing some things and also have some notes:

instruction per line

.data MOV A, B MOV C, D DW 100 DUP(0x30)
;; or

this is perfectly valid syntax? would enforcing a instruction per line be a bad thing?

character literals

164

binary literal integers

165

octal literal integers

166

escape sequences

something like C escape sequences?

conclusion

Would be nice to model the parser of the assembler to a documented syntax.

Grammar

binary = '0' | '1' ;

octal = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' ;

non_zero_digit = '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;
digit = '0' | non_zero_digits ;

hexadecimal = digits | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' ;

lowercase = 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z' ;
uppercase = 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T' | 'U' | 'V' | 'W' | 'X' | 'Y' | 'Z' ;

letter = lowercase | uppercase ;

symbol = '`' | '~' | '!' | '@' | '#' | '$' | '%' | '^' | '&' | '*' | '(' | ')' | '-' | '+' | '=' | '{' | '}' | '[' | ']' | '|' | '\' | ':' | ';' | '<' | '>' | '?' | '/' ;
escaped_double_quote = '\"' ;

identifier_start = letter | '_' ;
identifier_character = identifier_start | digit ;
identifier = identifier_start , { identifier_character } ;

label = identifier , ':' ;

(* what about binary/octal literals? *)
decimal_integer_literal = digit , { digit } ;
hexadecimal_integer_literal = '0x' , hexadecimal , { hexadecimal } ;
integer_literal = decimal_integer_literal | hexadecimal_integer_literal ;

(* character literals? escape sequences? *)
string_literal = '"' , { letter | digit | symbol | '_' | "'" | escaped_double_quote } , '"' ;

literal = integer_literal | string_literal ;

equ_directive = identifier , 'EQU' , integer_literal ;

(* string and/or sequence (1,2,3,...) support? *)
dup_operand = integer_literal , 'DUP(', integer_literal , ')' ;
dw_directive = 'DW' , literal | dup_operand , { ',' , literal | dup_operand } ;

text_section = '.' , 'text' ;
data_section = '.' , 'data' ;
section = text_section | data_section ;

memory_reference = '[' , integer_literal | identifier , ']' ;

mmemonic = (* list of all instructions *)

operand = identifier | integer_literal | memory_reference ;
instruction = mmemonic , { operand } ;

(* instructions, sections and labels do not need to be on seperate lines? might be easier parsing if they do *)
line = [ label | section ] , { instruction | dw_directive } ;
program = { line }

simon987 / Much-Assembly-Required

EBNF Grammar #167