kursjan / petitparser2

A high-performance top-down parser
MIT License
41 stars 19 forks source link

How to keep instance variables in sync? #13

Closed lukego closed 7 years ago

lukego commented 7 years ago

I am really enjoying getting started with PetitParser2. I have a few questions about how to setup my basic development workflow. Here is the first one :).

How should I keep the methods and productions for instance variables in sync? It seems like my parser class must have a 1:1 mapping between instance variables and productions in the grammar. Just now I am maintaining this manually in the class browser. I occasionally mess it up, e.g. when copying a method from one image to another without taking the instance variable along, and the error message is quite obscure in this case.

Tips welcome :).

kursjan commented 7 years ago

Hi, indeed, this is a bit of a problem. There is some effort in Moose called PetitParser browser (tools->PetitParser2). Some people enjoy using this tool, I never get used to it though. I manually manage the dependencies (I never found it especially annoying, if you do, I welcome tips for improvements):

It is not 100% true you need the 1:1 mapping. If you check the class-side ignoredNames method, you can define your instvars, that do not contain parsers, but some other stuff (though I would recommend to avoid this if possible and keep the grammar/parser as simple as possible).

The reason is that, during instantiation of PP2CompositeParser all the instvars are initialized with the values returned by the corresponding method. The bit of magic is that PP handles cyclic dependencies.

lukego commented 7 years ago

Thanks for the feedback. I am okay with this workflow although I think that new users would benefit from more explicit error messages. Took me a fair bit of time and frustration to understand that I am supposed to maintain the instance variable declaration and not just write my production methods. This was especially an issue because I started with the PetitParser Browser which does all the instance variables automagically so I had no notion that they were supposed to exist.

lukego commented 7 years ago

Overall the issue I have is that it is just so nice to work with PetitParser2 directly in the workspace and then it feels like a drag to then create the classes and test methods and so on. Hopefully this will pay off in terms of maintainability and extensibility though.

kursjan commented 7 years ago

Thanks for the feedback, perhaps something like this might be helpful?

lukego commented 7 years ago

I'm mostly ignoring the critics because they have been a bit unhelpful. It actually complains that I have too many instance variables in my class, even though for a PetitParser I need to have one for each production :).

One more thing that's niggling me a bit is how much typing I am doing. For example, to create a simple production called array I end up creating the method array, a method on the test class called array with a sample string, and then a test method called testArray that says self parse: self array rule: #array. Then I've typed the word "array" at least five times... and that's before I create a subclass that is supposed to override these methods to create a real object representation.

Is this normal? Is there a shortcut? would it be terrible to collapse the whole parser into one method and write it the same way that I would in the Playground?

lukego commented 7 years ago

(I clicked the "this was not helpful" button on a couple of critics and left feedback but I have no idea where that ends up.)

lukego commented 7 years ago

I suspect that I chose the wrong problem for PetitParser2 in this instance actually.

Goal is to extract the internal structure of C programs from their debug info. Starts with DWARF binary format but I use readelf to convert that into text like this:

 <1><f9>: Abbrev Number: 16 (DW_TAG_structure_type)
    <fa>   DW_AT_name        : (indexed string: 0xbd1): lua_State
    <fc>   DW_AT_byte_size   : 96
    <fd>   DW_AT_decl_file   : 5
    <fe>   DW_AT_decl_line   : 564
    <100>   DW_AT_sibling     : <0x1b0>
 <2><104>: Abbrev Number: 15 (DW_TAG_member)
    <105>   DW_AT_name        : (indexed string: 0xb1f): nextgc
    <107>   DW_AT_decl_file   : 5
    <108>   DW_AT_decl_line   : 565
    <10a>   DW_AT_type        : <0xefe>
    <10e>   DW_AT_data_member_location: 0

and then I want to extract the definitions of all C structs, typedefs, etc.

I started working in the Playground and found that quite fun. However, when I started to make a Smalltalk class out of my parser I slowly accumulated dozens of methods for the productions and the tests. The code started to feel a bit over-engineered with too much structure.

This morning I tried a new approach of simply massaging the text into YAML format that looks like this:

<f9>:
  tag: structure_type
  name: lua_State
  byte_size: 96
  decl_file: 5
  decl_line: 564
  sibling: <0x1b0>
<104>:
  tag: member
  name: nextgc
  decl_file: 5
  decl_line: 565
  type: <0xefe>
  data_member_location: 0

This seems quite promising: Now I should be able to load this into Pharo using the existing PetitParser YAML parser and then construct the data structures that I need from that.

So I'm not sure if I picked the wrong problem for PetitParser, or picked the wrong approach to the code, or just have not acclimatised to Smalltalk's style with lots of small methods yet. Something to ponder.

Here is the dwarf2yml program JFYI:

#!/usr/bin/awk -f
BEGIN { FS=": " }

# From:                                                                                                                                                                                                             
#   <0><b>: Abbrev Number: 1 (DW_TAG_compile_unit)                                                                                                                                                                  
# To:                                                                                                                                                                                                               
#   <b>:                                                                                                                                                                                                            
#     tag: compile_unit                                                                                                                                                                                             
/Abbrev Number.*DW_TAG/ {
    match($1, /<[0-9a-f]+><([0-9a-f]+)>/, id);
    match($NF, "DW_TAG_([a-z_]+)", tag);
    printf("<%s>:\n", id[1]);
    printf("  tag: %s\n", tag[1]);
}

# From:                                                                                                                                                                                                             
#   <47>   DW_AT_byte_size   : 24                                                                                                                                                                                   
# To:                                                                                                                                                                                                               
#   byte_size: 24                                                                                                                                                                                                   
/DW_AT_/ {
    gsub("\t", " ");            # tabs to spaces                                                                                                                                                                    
    match($1, "DW_AT_([a-zA-Z0-9_]+)", name);
    printf("  %s: %s\n", name[1], $NF);
}

I'll close the issues shortly and pop back again the next time I have a good problem for PetitParser2. (Just wondering... would PetitParser2 make a good basis for a bytecode decompiler?)