Using 'cont' for preprocess files generate code bloat ?

It's interesting your idea of using the preprocessor cont but looking at the generated files they have a full copy of almost everything and using inline for not so small functions (although the compiler can decide not to inline) doesn't it generate more code than needed ?

I like the idea of a parser and grammar interpreter to rapid prototype and also the possibility of using Lua for the actions while prototyping. I'm collecting grammars here https://github.com/mingodad/plgh and I found this tool https://www.bottlecaps.de/rr/ui fantastic to help visualize the grammar, there is any plan to add an option to output the grammar in EBNF format compatible to visualize the grammar ?

Thanks for your great work ?

I noticed that you are using a non conventional syntax to describe regular expressions, there is any particular reason for it ?

Here is the build/rules/uclang_parser.rules converted with this script (that uses Lua patter matching) to EBNF that we can use on https://www.bottlecaps.de/rr/ui:


start ::= end_check

end_check ::= program _END_
end_check ::= _END_

program ::= import_list

//   $ --- import list --
import_list ::= import import_list
import_list ::= namespace_def_list

import ::= import id semicolon

//   $ --- namespace definitions list --
namespace_def_list ::= namespace_def namespace_def_list

namespace_def_list ::= namespace_def

//   $ --- namespace definition --
namespace_def ::= namespace_def_name lc_br namespace_end

namespace_def_name ::= namespace namespace_id_list

//   $ --- namespace using --
namespace_def ::= namespace_using_name lc_br namespace_end

namespace_using_name ::= using namespace namespace_id_list

namespace_end ::= namespace_def_list rc_br

namespace_end ::= rc_br

namespace_id_list ::=  namespace_id namespace_id_list

namespace_id_list ::= id

namespace_id ::= id right_arrow

namespace_def ::= top_class

//   $ --- modifiers definition --
def_modifier ::= public
def_modifier ::= private
def_modifier ::= static
def_modifier ::= abstract
def_modifier ::= final
def_modifier ::= parallel

//   $ --- top class definition --
top_class ::= def_modifier top_class
top_class ::= class_def

//   $ --- class definition --
class_def ::= class_name class_extends class_parts

class_name ::= class id

class_extends ::= extends namespace_id_list lc_br
class_extends ::= lc_br

class_parts ::= class_part_modifiers class_parts
class_parts ::= rc_br

class_part_modifiers ::= def_modifier class_part_modifiers
class_part_modifiers ::= class_part

class_part ::= class_element
class_part ::= class_def
class_part ::= method_def

class_element ::= id semicolon
class_element ::= class_element_expression semicolon
class_element_expression ::= class_element_name equal exp

class_element_name ::= id

//   $ --- method definition --
method_def ::= method_name method_parameters method_body
method_name ::= id lr_br

method_parameters ::= method_parameter_list rr_br
method_parameters ::= rr_br

method_parameter_list ::= method_parameter_list comma method_parameter
method_parameter_list ::= method_parameter

method_parameter ::= id
method_parameter ::= at id

method_body ::= semicolon
method_body ::= method_body_begin rc_br
method_body ::= method_body_begin command_list rc_br

method_body_begin ::= lc_br

//   $ --- command list --
command_list ::= command_list command
command_list ::= command

//   $ --- command block --
command ::= command_block

command_block ::= lc_br rc_br
command_block ::= command_block_begin command_list rc_br
command_block_begin ::= lc_br

//   $ --- try catch statement --
command ::= try_catch_block

try_catch_block ::= try_begin command_block catch_begin command_block
try_begin ::= try
catch_begin ::= catch lr_br id rr_br

//   $ --- if, if-else statement --
command ::= if condition if_else
if_else ::= command
if_else ::= command else command

//   $ --- while statement --
command ::= while_begin condition command
while_begin ::= while

//   $ --- do-while statement --
command ::= do_while_begin command while condition semicolon
do_while_begin ::= do

//   $ --- for statement --
command ::= for_begin lr_br for_expression rr_br command
for_expression ::= for_identifier left_arrow_in exp
for_identifier ::= id
for_begin ::= for

//   $ --- switch statement --
command ::= switch lr_br switch_expression rr_br switch_item_list rc_br

switch_expression ::= exp

switch_item_list ::= switch_item_list switch_item

switch_item_list ::= lc_br

switch_item ::= switch_item_begin command

switch_item_begin ::= case case_exp_list colon

switch_item_begin ::= default colon

case_exp_list ::= case_exp_list comma exp
case_exp_list ::= exp

//   $ --- break statement --
command ::= break semicolon

//   $ --- continue statement --
command ::= continue semicolon

//   $ --- return statement --
command ::= return expression semicolon

//   $ --- command exp --
command ::= expression semicolon

//   $ --- init exp --
A ::= array_elements re_br

//   $ --- array --
array_elements ::= array_elements_begin
array_elements ::= array_elements_begin array_element_list

array_elements_begin ::= le_br

array_element_list ::= array_element_list comma exp
array_element_list ::= exp

//   $ --- condition --
condition ::= lr_br exp rr_br

//   $ --- expression --
expression ::= exp

//   $ --- exp --
exp ::= H

//   $ --- exp operators --
H ::= H equal H
H ::= H plus_equal H
H ::= H minus_equal H
H ::= H asterisk_equal H
H ::= H slash_equal H
H ::= H percent_equal H
H ::= H double_ls_br_equal H
H ::= H double_rs_br_equal H
H ::= H ampersand_equal H
H ::= H pipe_equal H
H ::= H circumflex_equal H
H ::= G

G ::= G double_ampersand F
G ::= G double_pipe F
G ::= F

F ::= F double_equal E
F ::= F exclamation_equal E
F ::= F rs_br E
F ::= F ls_br E
F ::= F rs_br_equal E
F ::= F ls_br_equal E
F ::= F left_arrow_in E
F ::= E

E ::= E ampersand D
E ::= E pipe D
E ::= E circumflex D
E ::= D

D ::= D double_rs_br C
D ::= D double_ls_br C
D ::= C

C ::= C plus B
C ::= C minus B
C ::= B

B ::= B asterisk A
B ::= B slash A
B ::= B percent A

B ::= A

A ::= A double_plus
A ::= A double_minus
A ::= double_plus A
A ::= double_minus A
A ::= plus A
A ::= minus A
A ::= exclamation A
A ::= tilde A

A ::= A le_br item_range re_br

item_range ::= H

//   $ --- slice rules --
item_range ::= slice_range

slice_range ::= exp_colon_exp_colon

slice_range ::= exp_colon_exp_colon H

exp_colon_exp_colon ::= exp_colon exp_colon

exp_colon ::= H colon

exp_colon ::= colon

//   $ --- exp bracket --
A ::= lr_br H rr_br

//   $ --- identifier --
A ::= id

//   $ --- this access --
A ::= this

//   $ --- new object creation --
A ::= object_class_name parameters
A ::= object_class_name le_br exp re_br

//   $ --- object class name --
object_class_name ::= new class_access

//   $ --- free existing object --
A ::= free exp

//   $ --- type identification --
A ::= type A

//   $ --- convert to string --
A ::= dollar A

//   $ --- object reference copy --
A ::= A equal_at H

//   $ --- conditional expression --
H ::= H question exp colon exp

//   $ --- logical operators --
H ::= H and exp

H ::= H or exp

//   $ --- class access --
A ::= namespace_id class_access

class_access ::= namespace_id_list

//   $ --- object member --
A ::= object_member

//   $ --- method call --
A ::= id parameters
A ::= object_member parameters

parameters ::= parameters_begin rr_br
parameters ::= parameters_begin parameter_list rr_br

parameters_begin ::= lr_br

parameter_list ::= parameter_list comma exp
parameter_list ::= exp

//   $ --- object member --
object_member ::= A dot id

//   $ --- lambda function --
A ::= lambda_begin lambda_parameters command

lambda_begin ::= colon lr_br

lambda_parameters ::= method_parameter_list rr_br

lambda_parameters ::= rr_br

//   $ --- constant values --
A ::= single_char_const
A ::= octal_char_const
A ::= hex_char_const
A ::= backslash_char_const

A ::= bin_int_const
A ::= oct_int_const
A ::= dec_int_const
A ::= hex_int_const

A ::= float_const

A ::= string_const_list

string_const_list ::= string_const_list string_const

string_const_list ::= string_const

string_const ::= string_const

The conversion script (using https://github.com/mingodad/squilu):

auto fname = "uclang_parser.rules";
auto txt = readfile(fname);

txt = txt.gsub(".-\nrules:", "", 1);
txt = txt.gsub("\n[ \t]*$ --", "\n//   $ --");
txt = txt.gsub("%->>\n   {.-\n   }", "");
txt = txt.replace("->> {}", "");
txt = txt.replace("> -> ", " ::= ");
txt = txt.replace("\n   <", "\n");
txt = txt.gsub("<([^>%s]+)>", "%1");

print(txt);

Notice that how https://www.bottlecaps.de/rr/ui do some nice simplifications/opitmizations :

A        ::= array_elements re_br
           | A ( double_plus | double_minus | le_br item_range re_br | equal_at H )
           | ( double_plus | double_minus | plus | minus | exclamation | tilde | type | dollar ) A
           | lr_br H rr_br
           | ( id | object_member ) parameters?
           | this
           | object_class_name ( parameters | le_br exp re_br )
           | free exp
           | namespace_id class_access
           | lambda_begin lambda_parameters command
           | single_char_const
           | octal_char_const
           | hex_char_const
           | backslash_char_const
           | bin_int_const
           | oct_int_const
           | dec_int_const
           | hex_int_const
           | float_const
           | string_const_list

Also there is no complete example using the generated parser (C++,JS, RUST, PHP, AWK) other than dump the reduce action. For example how to do the same as this from the README but from the generated C++,JS, RUST, PHP, AWK ?

init_code: {s = {\};}

terminals:
  oct_int_const {'0'.<07>*}
  dec_int_const {<19>.d*}
  hex_int_const {'0'.[xX].(<09>+<af>+<AF>).(<09>+<af>+<AF>)*}

  lr_br    {'('}
  rr_br    {')'}

  plus     {'+'}
  minus    {'-'}
  asterisk {'*'}
  slash    {'/'}
  percent  {'%'}

  _SKIP_   {w.w*}
  _END_    {'\0'}

nonterminals:
  <start> <exp> <C> <B> <A>

rules:
  <start> -> <exp> _END_  ->> {}
  <exp> -> <C>            ->> {print("result: "..s[#s]);}
  <C> -> <C> plus <B>     ->> {s[#s-1] = s[#s-1] + table.remove(s);}
  <C> -> <C> minus <B>    ->> {s[#s-1] = s[#s-1] - table.remove(s);}
  <C> -> <B>              ->> {}
  <B> -> <B> asterisk <A> ->> {s[#s-1] = s[#s-1] * table.remove(s);}
  <B> -> <B> slash <A>    ->> {s[#s-1] = s[#s-1] / table.remove(s);}
  <B> -> <B> percent <A>  ->> {s[#s-1] = s[#s-1] % table.remove(s);}
  <B> -> <A>              ->> {}
  <A> -> lr_br <C> rr_br  ->> {}
  <A> -> oct_int_const    ->> {table.insert(s,tonumber(rule_body(0),8));}
  <A> -> dec_int_const    ->> {table.insert(s,tonumber(rule_body(0),10));}
  <A> -> hex_int_const    ->> {table.insert(s,tonumber(rule_body(0)));}

It's interesting your idea of using the preprocessor cont but looking at the generated files they have a full copy of almost everything and using inline for not so small functions (although the compiler can decide not to inline) doesn't it generate more code than needed ?

I like the idea of a parser and grammar interpreter to rapid prototype and also the possibility of using Lua for the actions while prototyping. I'm collecting grammars here https://github.com/mingodad/plgh and I found this tool https://www.bottlecaps.de/rr/ui fantastic to help visualize the grammar, there is any plan to add an option to output the grammar in EBNF format compatible to visualize the grammar ?

Thanks for your great work ?

Thank you for your interest.

Preprocessor cont generates all code for described containers. It distinguishes two types of generated methods: inlines and methods. Inline methods should be generated in header files of source code components (compilation units), and thus are propagated to each code component that includes its header file.

Preprocessor cont executed with argument --geninc mimics work of C++ preprocessor and generates one large C++ file for compiler. This is why beginning of every generated file looks almost same. These are headers of other compilation units.

Functions to be generated as inline for each container (abstract data type eg. list, tree, ...) type was determined by me. So it is just pure experience based estimation. Surely compiler does its work, but at least he has this inline function available in headers and can opt to not inline them (as you said). What compiler cannot do is to inline functions with implementations in other compilation units.

It looks that you already solved output of grammar to EBNF format.

I noticed that you are using a non conventional syntax to describe regular expressions, there is any particular reason for it ?

There is no good reason to it. Codebase is almost 12 years old, and back than I have no good insight in standard regex syntax. This syntax emerged from knowledge obtained at CS regular languages class.

So I admit, syntax of regex is not standard, but on other side, it is working for me.

Also there is no complete example using the generated parser (C++,JS, RUST, PHP, AWK) other than dump the reduce action. For example how to do the same as this from the README but from the generated C++,JS, RUST, PHP, AWK ?

After generating source code of parser, work of yapgen is done. It is user responsibility to integrate generated code to project, and replace dump of reduction rules by real code.

Semantic rules written in LUA are intended to simplify grammar design process, and to enable some basic scripting support.

Generated code consists of two basic components:

Function recognize_terminal recognize next terminal symbol (input for parser grammar) in input stream of bytes.
Function parser_parse_source_string processes input stream of terminal symbols by parser based on shift and reduce actions from generated LALR table.

Names of generated function are not important, because in target project it will be changed, or actually be methods of some class. Generated code do not impose any restrictions on target project structure.

Identifiers used in generated code are prefixed by _prefix_ or _PREFIX_ to solve name clashes in projects with more parsers in one namespace. Those prefixes should be replaced by parser identification.

Function recognize_terminal contains two macros (C/C++ code):

_PREFIX_GET_NEXT_CHAR - Retrieve next byte from input stream to parse.
_PREFIX_CLOSE_CHAR - Confirm processing of actual byte, move to next.

These macros can be adjusted to process various types of input streams.

Function parser_parse_source_string should properly react to rules reductions. It will probably call some callbacks of compiler or interpreter.

For example in uclang JSON module:

Parse actions call parser callbacks ucl_json.cc#L1001. Callbacks are implemented in file json_parse_actions.cc.
Callbacks can retrieve position of substrings representing terminal and nonterminal symbols of grammar. Eg. integer.

Here is the build/rules/uclang_parser.rules converted with this script (that uses Lua patter matching) to EBNF that we can use on https://www.bottlecaps.de/rr/ui:

The conversion script (using https://github.com/mingodad/squilu):

Notice that how https://www.bottlecaps.de/rr/ui do some nice simplifications/opitmizations :

Nice and simple conversion script, and visualization looks also great.

Some of grammar rules may seem complex, but mostly it is because of need to generate reduction callbacks in interpreter or solving SLR(1) grammar restrictions.

izuzanak / yapgen

Using 'cont' for preprocess files generate code bloat ? #2