Open mingodad opened 3 years ago
I noticed that you are using a non conventional syntax to describe regular expressions, there is any particular reason for it ?
Here is the build/rules/uclang_parser.rules
converted with this script (that uses Lua patter matching) to EBNF that we can use on https://www.bottlecaps.de/rr/ui:
start ::= end_check
end_check ::= program _END_
end_check ::= _END_
program ::= import_list
// $ --- import list --
import_list ::= import import_list
import_list ::= namespace_def_list
import ::= import id semicolon
// $ --- namespace definitions list --
namespace_def_list ::= namespace_def namespace_def_list
namespace_def_list ::= namespace_def
// $ --- namespace definition --
namespace_def ::= namespace_def_name lc_br namespace_end
namespace_def_name ::= namespace namespace_id_list
// $ --- namespace using --
namespace_def ::= namespace_using_name lc_br namespace_end
namespace_using_name ::= using namespace namespace_id_list
namespace_end ::= namespace_def_list rc_br
namespace_end ::= rc_br
namespace_id_list ::= namespace_id namespace_id_list
namespace_id_list ::= id
namespace_id ::= id right_arrow
namespace_def ::= top_class
// $ --- modifiers definition --
def_modifier ::= public
def_modifier ::= private
def_modifier ::= static
def_modifier ::= abstract
def_modifier ::= final
def_modifier ::= parallel
// $ --- top class definition --
top_class ::= def_modifier top_class
top_class ::= class_def
// $ --- class definition --
class_def ::= class_name class_extends class_parts
class_name ::= class id
class_extends ::= extends namespace_id_list lc_br
class_extends ::= lc_br
class_parts ::= class_part_modifiers class_parts
class_parts ::= rc_br
class_part_modifiers ::= def_modifier class_part_modifiers
class_part_modifiers ::= class_part
class_part ::= class_element
class_part ::= class_def
class_part ::= method_def
class_element ::= id semicolon
class_element ::= class_element_expression semicolon
class_element_expression ::= class_element_name equal exp
class_element_name ::= id
// $ --- method definition --
method_def ::= method_name method_parameters method_body
method_name ::= id lr_br
method_parameters ::= method_parameter_list rr_br
method_parameters ::= rr_br
method_parameter_list ::= method_parameter_list comma method_parameter
method_parameter_list ::= method_parameter
method_parameter ::= id
method_parameter ::= at id
method_body ::= semicolon
method_body ::= method_body_begin rc_br
method_body ::= method_body_begin command_list rc_br
method_body_begin ::= lc_br
// $ --- command list --
command_list ::= command_list command
command_list ::= command
// $ --- command block --
command ::= command_block
command_block ::= lc_br rc_br
command_block ::= command_block_begin command_list rc_br
command_block_begin ::= lc_br
// $ --- try catch statement --
command ::= try_catch_block
try_catch_block ::= try_begin command_block catch_begin command_block
try_begin ::= try
catch_begin ::= catch lr_br id rr_br
// $ --- if, if-else statement --
command ::= if condition if_else
if_else ::= command
if_else ::= command else command
// $ --- while statement --
command ::= while_begin condition command
while_begin ::= while
// $ --- do-while statement --
command ::= do_while_begin command while condition semicolon
do_while_begin ::= do
// $ --- for statement --
command ::= for_begin lr_br for_expression rr_br command
for_expression ::= for_identifier left_arrow_in exp
for_identifier ::= id
for_begin ::= for
// $ --- switch statement --
command ::= switch lr_br switch_expression rr_br switch_item_list rc_br
switch_expression ::= exp
switch_item_list ::= switch_item_list switch_item
switch_item_list ::= lc_br
switch_item ::= switch_item_begin command
switch_item_begin ::= case case_exp_list colon
switch_item_begin ::= default colon
case_exp_list ::= case_exp_list comma exp
case_exp_list ::= exp
// $ --- break statement --
command ::= break semicolon
// $ --- continue statement --
command ::= continue semicolon
// $ --- return statement --
command ::= return expression semicolon
// $ --- command exp --
command ::= expression semicolon
// $ --- init exp --
A ::= array_elements re_br
// $ --- array --
array_elements ::= array_elements_begin
array_elements ::= array_elements_begin array_element_list
array_elements_begin ::= le_br
array_element_list ::= array_element_list comma exp
array_element_list ::= exp
// $ --- condition --
condition ::= lr_br exp rr_br
// $ --- expression --
expression ::= exp
// $ --- exp --
exp ::= H
// $ --- exp operators --
H ::= H equal H
H ::= H plus_equal H
H ::= H minus_equal H
H ::= H asterisk_equal H
H ::= H slash_equal H
H ::= H percent_equal H
H ::= H double_ls_br_equal H
H ::= H double_rs_br_equal H
H ::= H ampersand_equal H
H ::= H pipe_equal H
H ::= H circumflex_equal H
H ::= G
G ::= G double_ampersand F
G ::= G double_pipe F
G ::= F
F ::= F double_equal E
F ::= F exclamation_equal E
F ::= F rs_br E
F ::= F ls_br E
F ::= F rs_br_equal E
F ::= F ls_br_equal E
F ::= F left_arrow_in E
F ::= E
E ::= E ampersand D
E ::= E pipe D
E ::= E circumflex D
E ::= D
D ::= D double_rs_br C
D ::= D double_ls_br C
D ::= C
C ::= C plus B
C ::= C minus B
C ::= B
B ::= B asterisk A
B ::= B slash A
B ::= B percent A
B ::= A
A ::= A double_plus
A ::= A double_minus
A ::= double_plus A
A ::= double_minus A
A ::= plus A
A ::= minus A
A ::= exclamation A
A ::= tilde A
A ::= A le_br item_range re_br
item_range ::= H
// $ --- slice rules --
item_range ::= slice_range
slice_range ::= exp_colon_exp_colon
slice_range ::= exp_colon_exp_colon H
exp_colon_exp_colon ::= exp_colon exp_colon
exp_colon ::= H colon
exp_colon ::= colon
// $ --- exp bracket --
A ::= lr_br H rr_br
// $ --- identifier --
A ::= id
// $ --- this access --
A ::= this
// $ --- new object creation --
A ::= object_class_name parameters
A ::= object_class_name le_br exp re_br
// $ --- object class name --
object_class_name ::= new class_access
// $ --- free existing object --
A ::= free exp
// $ --- type identification --
A ::= type A
// $ --- convert to string --
A ::= dollar A
// $ --- object reference copy --
A ::= A equal_at H
// $ --- conditional expression --
H ::= H question exp colon exp
// $ --- logical operators --
H ::= H and exp
H ::= H or exp
// $ --- class access --
A ::= namespace_id class_access
class_access ::= namespace_id_list
// $ --- object member --
A ::= object_member
// $ --- method call --
A ::= id parameters
A ::= object_member parameters
parameters ::= parameters_begin rr_br
parameters ::= parameters_begin parameter_list rr_br
parameters_begin ::= lr_br
parameter_list ::= parameter_list comma exp
parameter_list ::= exp
// $ --- object member --
object_member ::= A dot id
// $ --- lambda function --
A ::= lambda_begin lambda_parameters command
lambda_begin ::= colon lr_br
lambda_parameters ::= method_parameter_list rr_br
lambda_parameters ::= rr_br
// $ --- constant values --
A ::= single_char_const
A ::= octal_char_const
A ::= hex_char_const
A ::= backslash_char_const
A ::= bin_int_const
A ::= oct_int_const
A ::= dec_int_const
A ::= hex_int_const
A ::= float_const
A ::= string_const_list
string_const_list ::= string_const_list string_const
string_const_list ::= string_const
string_const ::= string_const
The conversion script (using https://github.com/mingodad/squilu):
auto fname = "uclang_parser.rules";
auto txt = readfile(fname);
txt = txt.gsub(".-\nrules:", "", 1);
txt = txt.gsub("\n[ \t]*$ --", "\n// $ --");
txt = txt.gsub("%->>\n {.-\n }", "");
txt = txt.replace("->> {}", "");
txt = txt.replace("> -> ", " ::= ");
txt = txt.replace("\n <", "\n");
txt = txt.gsub("<([^>%s]+)>", "%1");
print(txt);
Notice that how https://www.bottlecaps.de/rr/ui do some nice simplifications/opitmizations :
A ::= array_elements re_br
| A ( double_plus | double_minus | le_br item_range re_br | equal_at H )
| ( double_plus | double_minus | plus | minus | exclamation | tilde | type | dollar ) A
| lr_br H rr_br
| ( id | object_member ) parameters?
| this
| object_class_name ( parameters | le_br exp re_br )
| free exp
| namespace_id class_access
| lambda_begin lambda_parameters command
| single_char_const
| octal_char_const
| hex_char_const
| backslash_char_const
| bin_int_const
| oct_int_const
| dec_int_const
| hex_int_const
| float_const
| string_const_list
Also there is no complete example using the generated parser (C++,JS, RUST, PHP, AWK) other than dump the reduce action. For example how to do the same as this from the README but from the generated C++,JS, RUST, PHP, AWK ?
init_code: {s = {\};}
terminals:
oct_int_const {'0'.<07>*}
dec_int_const {<19>.d*}
hex_int_const {'0'.[xX].(<09>+<af>+<AF>).(<09>+<af>+<AF>)*}
lr_br {'('}
rr_br {')'}
plus {'+'}
minus {'-'}
asterisk {'*'}
slash {'/'}
percent {'%'}
_SKIP_ {w.w*}
_END_ {'\0'}
nonterminals:
<start> <exp> <C> <B> <A>
rules:
<start> -> <exp> _END_ ->> {}
<exp> -> <C> ->> {print("result: "..s[#s]);}
<C> -> <C> plus <B> ->> {s[#s-1] = s[#s-1] + table.remove(s);}
<C> -> <C> minus <B> ->> {s[#s-1] = s[#s-1] - table.remove(s);}
<C> -> <B> ->> {}
<B> -> <B> asterisk <A> ->> {s[#s-1] = s[#s-1] * table.remove(s);}
<B> -> <B> slash <A> ->> {s[#s-1] = s[#s-1] / table.remove(s);}
<B> -> <B> percent <A> ->> {s[#s-1] = s[#s-1] % table.remove(s);}
<B> -> <A> ->> {}
<A> -> lr_br <C> rr_br ->> {}
<A> -> oct_int_const ->> {table.insert(s,tonumber(rule_body(0),8));}
<A> -> dec_int_const ->> {table.insert(s,tonumber(rule_body(0),10));}
<A> -> hex_int_const ->> {table.insert(s,tonumber(rule_body(0)));}
It's interesting your idea of using the preprocessor cont but looking at the generated files they have a full copy of almost everything and using inline for not so small functions (although the compiler can decide not to inline) doesn't it generate more code than needed ?
I like the idea of a parser and grammar interpreter to rapid prototype and also the possibility of using Lua for the actions while prototyping. I'm collecting grammars here https://github.com/mingodad/plgh and I found this tool https://www.bottlecaps.de/rr/ui fantastic to help visualize the grammar, there is any plan to add an option to output the grammar in EBNF format compatible to visualize the grammar ?
Thanks for your great work ?
Thank you for your interest.
Preprocessor cont
generates all code for described containers. It distinguishes two types of generated methods: inlines
and methods
. Inline methods should be generated in header files of source code components (compilation units), and thus are propagated to each code component that includes its header file.
Preprocessor cont
executed with argument --geninc
mimics work of C++ preprocessor and generates one large C++ file for compiler. This is why beginning of every generated file looks almost same. These are headers of other compilation units.
Functions to be generated as inline for each container (abstract data type eg. list, tree, ...) type was determined by me. So it is just pure experience based estimation. Surely compiler does its work, but at least he has this inline function available in headers and can opt to not inline them (as you said). What compiler cannot do is to inline functions with implementations in other compilation units.
It looks that you already solved output of grammar to EBNF format.
I noticed that you are using a non conventional syntax to describe regular expressions, there is any particular reason for it ?
There is no good reason to it. Codebase is almost 12 years old, and back than I have no good insight in standard regex syntax. This syntax emerged from knowledge obtained at CS regular languages class.
So I admit, syntax of regex is not standard, but on other side, it is working for me.
Also there is no complete example using the generated parser (C++,JS, RUST, PHP, AWK) other than dump the reduce action. For example how to do the same as this from the README but from the generated C++,JS, RUST, PHP, AWK ?
After generating source code of parser, work of yapgen
is done. It is user responsibility to integrate generated code to project, and replace dump of reduction rules by real code.
Semantic rules written in LUA
are intended to simplify grammar design process, and to enable some basic scripting support.
Generated code consists of two basic components:
recognize_terminal
recognize next terminal symbol (input for parser grammar) in input stream of bytes.parser_parse_source_string
processes input stream of terminal symbols by parser based on shift
and reduce
actions from generated LALR table.Names of generated function are not important, because in target project it will be changed, or actually be methods of some class. Generated code do not impose any restrictions on target project structure.
Identifiers used in generated code are prefixed by _prefix_
or _PREFIX_
to solve name clashes in projects with more parsers in one namespace. Those prefixes should be replaced by parser identification.
Function recognize_terminal
contains two macros (C/C++ code):
_PREFIX_GET_NEXT_CHAR
- Retrieve next byte from input stream to parse._PREFIX_CLOSE_CHAR
- Confirm processing of actual byte, move to next.These macros can be adjusted to process various types of input streams.
Function parser_parse_source_string
should properly react to rules reductions. It will probably call some callbacks of compiler or interpreter.
For example in uclang
JSON module:
Here is the build/rules/uclang_parser.rules converted with this script (that uses Lua patter matching) to EBNF that we can use on https://www.bottlecaps.de/rr/ui:
The conversion script (using https://github.com/mingodad/squilu):
Notice that how https://www.bottlecaps.de/rr/ui do some nice simplifications/opitmizations :
Nice and simple conversion script, and visualization looks also great.
Some of grammar rules may seem complex, but mostly it is because of need to generate reduction callbacks in interpreter or solving SLR(1) grammar restrictions.
It's interesting your idea of using the preprocessor cont but looking at the generated files they have a full copy of almost everything and using inline for not so small functions (although the compiler can decide not to inline) doesn't it generate more code than needed ?
I like the idea of a parser and grammar interpreter to rapid prototype and also the possibility of using Lua for the actions while prototyping. I'm collecting grammars here https://github.com/mingodad/plgh and I found this tool https://www.bottlecaps.de/rr/ui fantastic to help visualize the grammar, there is any plan to add an option to output the grammar in EBNF format compatible to visualize the grammar ?
Thanks for your great work ?