phorward / libphorward

C/C++ library for dynamic data structures, regular expressions, lexical analysis & more...
MIT License
24 stars 7 forks source link

Problems with json parser #10

Closed FreeBASIC-programmer closed 6 years ago

FreeBASIC-programmer commented 6 years ago

Problems with json parser

I have written a simple json parser to test the pbnf parser. And it works just fine. Sometimes. If I run the parser numerous times (running from command-line) then it will either work (and produce a parse tree) or it will crash. The message I get when it crashes (the message is FreeBASIC specific):

Aborting due to runtime error 12 ("segmentation violation" signal) in json.bas::PARSE_JSON()

If I run the program within the context of an editor (Programmer's Notepad) I get different results.

I can get the same message as above or I get multiple lines that read

base/list.c, 515: Function called with wrong or incomplete parameters, fix your call!

The message ends with the line

base/list.c, 516: !CORE!

Aborting due to runtime error 12 ("segmentation violation" signal) in json.bas::PARSE_JSON()

And sometimes the parser works (the ast gets created) and I get the message

base/list.c, 43: Function called with wrong or incomplete parameters, fix your call!
base/list.c, 43: Function called with wrong or incomplete parameters, fix your call!

If I get the above message I do not get a segmentation violation.

The parser sometimes works and sometimes crashes. And the messages I am getting seem to be related to the list implementation. I'd say the problem is pointer related?

The problem could also be with the version of mingw I am using (using mingw 3.4.5 with options -gdwarf-2 and -g3). Or with the grammar I have written. I am using FB 1.0.5 (compilation flags: -g -v -gen gas -w pedantic -exx -maxerr inf -R). This is the BASIC code I wrote to implement the json parser

  #include "file.bi"
  #include "crt/stdio.bi"

  #inclib "phorward4"

  enum runtime_errors
    no_error = 0
    illegal_function_call= 1
    file_not_found_signal
    file_io_error
    out_of_memory
    illegal_resume
    out_of_bounds_array_access
    null_pointer_access
    no_privileges
    interrupted_signal
    illegal_instruction_signal
    floating_point_error_signal
    segmentation_violation_signal
    termination_request_signal
    abnormal_termination_signal
    quit_request_signal
    return_without_gosub
    end_of_file 
  end enum

  extern "c"
    type ppgram as any
    type plex as any
    type pppar as any
    type ppast as any
    declare function pp_gram_from_pbnf( byval g as ppgram ptr, byval src as zstring ptr) as byte
    declare function pp_gram_create() as ppgram ptr
    declare function pp_par_create( byval g as ppgram ptr) as pppar ptr
    declare function pp_par_parse( byval root as ppast ptr, byval par as pppar ptr, byval start as zstring ptr) as byte
    declare function pp_par_autolex(byval par as pppar ptr) as integer
    declare sub pp_ast_dump( byval stream as FILE ptr, byval ast as ppast ptr)
    declare sub pp_ast_dump_short( byval stream as FILE ptr, byval ast as ppast ptr)
    declare function pp_ast_free(byval node as ppast ptr) as ppast ptr
    declare function pp_gram_free( byval g as ppgram ptr) as ppgram ptr
    declare function pp_par_free( byval p as pppar ptr) as pppar ptr
  end extern

  function parse_json(byval json_text as zstring ptr) as integer
    dim lexer as string = _
      $"%skip /[\s]+/;"_
      $"NUMBER : /((\+|\-)?[\d]+)((\.[\d]+([Ee](\+|\-)?[0-9]+)?)|([Ee](\+|\-)[\d]+))?/ = nnumber;"_
      $"NULL : /null/=nil;"_
      $"TRUE : /true/=true ;"_
      $"FALSE : /false/=false ;"_
      $"SCONST : /""((\x5c(\x5c|a|""|b|f|n|r|t|u[a-fA-F0-9][a-fA-F0-9][a-fA-F0-9][a-fA-F0-9]|v))|[^\x5c""])*""/=sconst;"_      
      $"CCONST : /'((\x5c(\x5c|a|'|b|f|n|r|t|v))|[^\x5c'])*'/=cconst;"_      
      $"ID    : /[a-zA-Z_][A-Za-z0-9_]*/ =id;"_
      $"COLON : "":"";"_
      $"COMMA : "","";"_
      $"LC    : ""{"";"_
      $"RC    : ""}"";"_  
      $"LB    : ""["";"_
      $"RB    : ""]"";"

    dim parser as string = _
     $"json$ : value;"_
     $"value : object =nobject| array =narray| ID | NUMBER | NULL | TRUE | FALSE | CCONST | SCONST;"_
     $"object : LC RC | LC pair_list RC ;"_
     $"pair_list : pair | pair_list COMMA pair;"_
     $"key : SCONST | CCONST | ID ;"_
     $"pair : key COLON value = pair;"_
     $"array : LB RB | LB value_list RB ;"_
     $"value_list : value | value_list COMMA value;"

    dim grm as ppgram ptr
    grm = pp_gram_create()
    dim s as string 
    s = lexer & parser
    var result = pp_gram_from_pbnf( grm,strptr(s))
    if (result = 0) then 
      print "Creating grammar failed"
      error illegal_function_call
    end if
    dim root as ppast ptr
    dim par as pppar ptr
    par = pp_par_create(grm)
    if (par = 0) then
      print "Creating parser failed"
      error illegal_function_call
    end if
    var result2 = pp_par_autolex(par)
    if result2 = 0 then
      error illegal_instruction_signal
    end if
    var v1 = timer()
    var result3 = pp_par_parse( @root, par, json_text)
    var v2 = timer()    
    if (result3 = 0) then
      error illegal_instruction_signal
    end if
    print using "Timing : ####.#####";v2 - v2
    dim fresult as FILE ptr
    fresult = fopen("ast_tree","w+b")
    if (fresult = 0) then
      print "Could not write ast"
      error file_io_error
    end if
    pp_ast_dump_short(fresult,root)
    fclose(fresult)
    pp_ast_free(root)
    pp_par_free(par)
    pp_gram_free(grm)
  return 0
end function

sub get_json(byref filename as string, byval flag as ubyte)
  if flag then    
    if (fileexists(filename) = 0) then
      print "Could not find file ";filename
      error file_not_found_signal
    end if
    dim fh as integer = freefile()
    if (fh = 0) then
      print "Could not create file descriptor"
      error file_io_error
    end if
    open filename for binary access read as #fh
    var err_ = err()
    if (err_ <> 0) then
      print "Could not open file ";filename
      error file_io_error
    end if  
    dim json as ubyte ptr = callocate(lof(fh)+1,sizeof(ubyte))
    if (json = 0) then
      error out_of_memory
    end if
    get #fh,,*json,lof(fh)
    err_ = err()
    if (err_ <> 0) then
      error err_
    end if
    json[lof(fh)] = asc(!"\0")
    close #fh
    parse_json(cast(zstring ptr,json))
    deallocate(json)
  else
    parse_json(strptr(filename))
  end if

end sub

get_json("generated.json",1)

Writing the parser was fun. If anything can be done to keep the compiled parser from crashing then that would be nice. Version of libphorward used: 0.22.2 (using json data generated by https://www.json-generator.com)

phorward commented 6 years ago

Hi, thanks for opening the issue. The reason why it cores is that the core is wanted here for debug-purposes (I've forgotten to remove it). But I have to open a test case to find where this problem's origin comes from. It seems it has to be done with an empty list that is tried to be extended somewhere.

I like your JSON parser, may I integrate it into the project source as an example? I recently added your function to generate DOT-formatted dumps of lexers into the library's main branch, see 9b49c20ea05f67b247995ef78db826f7dfa079a8 for details :+1:

phorward commented 6 years ago

Hello @FreeBASIC-programmer, I've extracted the grammar from your BASIC program and it runs very well with the pparse command-line utilty. Good work! :-) I really would like to add this grammar as example grammar to the project, and hope its OK for you.

However, I am unable to reproduce the errors you run into, even if I call just pp_par_parse() or the entire parser construction and execution process in an endless loop.

The only problem I've found was a pointer access problem that raised when running my example program (below) with valgrind. This is now fixed with 3c171be118ac243068cd799a1715e1fbe1fbd803, but I'm unsure if this fixes your problem. You might test it again with this commit and your BASIC program.

Can you tell me which function call in your BASIC program cause the above errors?

This is my test program in C which now runs without any errors with the above commit:

#include <phorward.h>

static int  stack[100];                             /* Stack for calculations */
static int* tos = &stack[0];                        /* Top-of-stack pointer */

void calc( ppasteval type, ppast* node )            /* AST evaluation */
{
    if( type != PPAST_EVAL_BOTTOMUP )
        return;

    if( !strcmp( node->emit, "Int" ) )
        *(tos++) = atoi( node->start );
    else if( !strcmp( node->emit, "add" ) )
        *((tos--) - 2) = *(tos - 2) + *(tos - 1);
    else if( !strcmp( node->emit, "mul" ) )
        *((tos--) - 2) = *(tos - 2) * *(tos - 1);
}

int main()
{
    ppgram* grm;
    pppar*  par;
    ppast*  ast;

    while( TRUE )
    {
        grm = pp_gram_create();
        pp_gram_from_pbnf( grm,
             "Int  := /[0-9]+/ ;"
             "fact : Int | '(' expr ')' ;"
             "term : term '*' fact = mul | fact ;"
             "expr : expr '+' term = add | term ;" );

        par = pp_par_create( grm );
        pp_par_autolex( par );

        if( !pp_par_parse( &ast, par, "1+2*(3+4)+8" ) )
            return 1;

        pp_ast_dump_short( stdout, ast );

        pp_ast_free( ast );
        pp_par_free( par );
        pp_gram_free( grm );

        getchar();
    }

    return 0;
}
FreeBASIC-programmer commented 6 years ago

You wrote

Hello @FreeBASIC-programmer, I've extracted the grammar from your BASIC program and it runs very well with the pparse command-line utilty. Good work! :-) I really would like to add this grammar as example grammar to the project, and hope its OK for you.

You may use whatever I post in any way you see fit. My JSON grammar differs from the one at https://www.json.org (identifiers are not allowed according to the JSON standard).

phorward commented 6 years ago

You may use whatever I post in any way you see fit. My JSON grammar differs from the one at https://www.json.org (identifiers are not allowed according to the JSON standard).

Well okay, then Thank you very much for yet anoter contribution to the project. I noticed the change with the identifiers, but I like it!

Have you been able to run your program now multiple times without crashing?

FreeBASIC-programmer commented 6 years ago

No, I have not. I thought I posted a bug report with my previous comment but apparently something went wrong there. I traced back the problem to ast_to_gram (file: pbnf.c). At line 258 there is a branch which is taken when a node is a termdef. It is possible to follow a path of execution to line 313

sym = pp_sym_create(gram,name,...)

without the content of name getting set properly. name is a static variable declared at the start of ast_to_gram like so

char            name        [ NAMELEN * 2 + 1 ];

After that declaration the content of name consists of bytes with a random value. If name does not get set properly (which is possible) then name will contain random values when pp_sym_create gets called. That in itself is a problem.

The reason the crashes occur infrequently (seemingly at random) has to do with the random content of name. The exact content of name matters (a lot). I am going to describe the entire execution path here starting from the first if in ast_to_gram

if( NODE_IS( child, "flag_ignore" ) )
  {
    flag_ignore = TRUE;
       child = child->next;
 }

Next routine that gets called is pp_sym_create. pp_sym_create line 44 (file:sym.c)

     if( name && pp_sym_get_by_name( g, name ) )

name is not null and pp_sym_get_by_name gets called. sym.c line 161

    return (ppsym*)plist_access( plist_get_by_key( g->symbols, name ) );

line 767 in plist_get_by_key looks like this (file:list.c)

idx = plist_hash_index( list, key );

plist_hash_index (file: list.c) line 75 and onward look like this

    for( len = (long)pstrlen( key ); len > 0; len-- )
       hashval += (long)key[ len - 1 ];

    return (int)( hashval % list->hashsize );

And that's where the problem starts. The line hashval += (long)key[len-1] accesses the variable name. That variable has a content that consists of random bytes. Or rather: random chars. And a char on windows is signed. Which means that the value of key[len-1] can be < 0. Which ultimately means the hashval can be < 0. And that means plist_hash_index can return a negative value.

If you look back at line 771 in plist_get_by_key then you see that this negative hash value gets assigned to idx. And plist_get_by_key continues like so (line 767 and further)

    idx = plist_hash_index( list, key );
    VARS( "idx", "%d", idx );

    for( e = list->hash[ idx ]; e; e = e->hashnext )
    {
        VARS( "e", "%p", e );
        VARS( "e->key", "%p", e->key );

        if( plist_hash_compare( list, e->key, key ) == 0 )
        {
            MSG( "Key matches" );
            RETURN( e );
        }
    }

After the call to plist_hash_index the value of idx < 0. Then at the start of the loop e gets assigned the value list->hash[idx]. idx < 0. list->hash[idx] does not yield a valid pointer as idx is negative (idx should have been >= 0). Any subsequent use of e will lead to a crash. Which happens a couple of lines further when e is used in

if (plist_hash_compare(list,e->key,key) == 0)

e has some random value (not a valid pointer) so e->key leads to a crash.

And that is where the problem is. Improper initialization of a variable (name) in combination with the fact that a char is signed on windows ultimately leads to an attempt to access memory that does not contain a valid address.

There is code that seems to be put in place to prevent the problems described from happening is at line 291 in ast_to_gram (it tries to set the value of name if it was not set before)

if (! *name)
  sprintf( name, "%.*s", (int)child->len, child->start );

But name is not 0 (it contains random values). So if (! *name) fails. And when that if statement fails name does not get initialized. Back to the path of execution I mentioned at the start. It starts at the first if (function ast_to_gram, line 262 onward)

if( NODE_IS( child, "flag_ignore" ) )
  {
    flag_ignore = TRUE;
       child = child->next;
 }
else if( NODE_IS( child, "Terminal" ) )
{
  sprintf( name, "%.*s", (int)child->len, child->start );
  child = child->next;
}
  else
     *name = '\0';

If the grammar starts with %skip then the first if expression yields true.

if( NODE_IS( child, "flag_ignore" ) )
  {
    flag_ignore = TRUE;
       child = child->next;
 }

name does not get assigned a value. I think the only possible path that can lead to the problem situation starts with the above if - expression. The random crash either occurs right away (when the first node of the ast gets processed) or it does not occur at all. And the fact that it does not occur at all if it does not occur right away is because name always gets assigned a value unless the expression (NODE_IS(child,"flag_ignore)) yields true.

There is something in the use of sprintf (in ast_to_gram) that makes me wonder how the program keeps sprintf from writing data to name beyond the end of name. The size of name is fixed but what if child->len is bigger then the amount of memory allocated for name? I have not checked to see whether there are checks in libphorward that makes sure writing beyond the end of name is not possible.

Long story short: the json parser does not work. Which is due to a bug in ast_to_gram.

phorward commented 6 years ago

Hello @FreeBASIC-programmer! Wow, thank you so much for your very detailed bug tracing and description. You're right, the buffer name remains uninitialized under some circumstances and this causes the problems coming up. I've hopefully fixed it now with adaaa85d47ab605d00a6c3b35971c2dc3a0cc4a6 to name is initialized to a zero-terminated string before any other case is encountered.

I have to admit that most parts of the pbnf parser has been written in a very quick&dirty style, so memory leaking may occur and boundaries are not checked right now. I know about that and will fix this somewhat later, when the library will be made more stable and is feature complete with these parts.

Can you please check out if the bug is now resolved by using above commit?

FreeBASIC-programmer commented 6 years ago

I will check your fix asap. And I must say that I am very pleased that I found a bug. Finding a bug always puts a smile on my face :smiley:

Now for something totally unrelated (but maybe of interest to you).

I have been looking for simple implementations of glr parsers (to get an idea of what it takes to write a minimal implementation of a glr parser). And the simplest I could find is a parser generator called tom (written in C). You can find it at https://www.cs.cmu.edu/Groups/AI/areas/nlp/parsing/tom/. It is a grammar interpreter in the sense that it's input consists of a grammar and an input file. The result consists of a parse forest and a graph structured stack. I found it quite education to see how small changes to a simple expression grammar can make a glr parser produce a parse forest instead of a parse tree.

Tom is in the public domain. I think that tom could be a good starting point for an implementation of a glr parser (in case you want to add glr parsing to libphorward). I think that using lalr1 parsing as the base for glr parsing (tom uses lr0 parsing) will increase the performance of tom in a huge way. Add to that a bit of precedence 'tinkering ' and other changes (there is lots and lots of room for improvement) and the performance of tom would be close to the performance of a lalr1 parser.

FreeBASIC-programmer commented 6 years ago

I tested the changes and it works fine now. The json parser never crashes. And it parses a json file quite fast. Me likes pbnf.

Totally unrelated to the json parser but perhaps of interest. I tried an alternative for the current hash function as used by libphorward. When hashing a list of round about 255 keywords the number of collisions produced by the current hash function is 71. Using the proposed hashing scheme the number of collisions is 47 (size of the hashtable: 769).

The new hash function is about as trivial as the old one (assuming key is a zero terminated string)

  while (*key) 
 {
    hashval += (hashval << 1) + *key;
    key++;
 }; 

The above function gives better results (less collisions) than the current hash function. Whether a decrease in collisions influences runtime performance in a 'big' way remains to be seen. But less collisions is a good thing.

phorward commented 6 years ago

Hello @FreeBASIC-programmer, thanks again for your feedback, and apologize my later answer. I've just released phorward v0.22.4 a few hours ago, which is now on my opinion the last bugfix release, so 0.23 will be continued.

I tested the changes and it works fine now. The json parser never crashes. And it parses a json file quite fast. Me likes pbnf.

I'm glad to hear that you successfully could test your parser now and it works fine :+1:

Now for something totally unrelated (but maybe of interest to you).

Tanks for the tip with this implementation. Right now, I'm experimenting with parsing combinators, which is also an interesting and relatively young field in parser development. I've found out that there are papers about generalized parsing combinators, so that's like GLL but with a function aspect in terms when talking about "little parsers" that are combined to huger ones, with the ability to handle left-recursion.

But you're right, one of the final targets to phorward would be a GLALR parser first, so I will take a look at it when I'm having some free time and fun on it.

Totally unrelated to the json parser but perhaps of interest.

Hehe, sure, this is a total interest. The hash function from plist is one thing where I really hoped to get assistance with, because the current one is maybe too slow and - as you noticed - causes more collisions. If it's okay, I would like to open a new issue for that and bring it into the lbrary. Thanks for the tip! :+1: