Program flow of bi-tbcnn

Modify the code provided in the paper: "Cross-Language Learning for Program Classification"

In order to use the code and modify it to fit our needs, we first have to understand roughly how it works.

Program Flow:

ast2vec

In the folder ast2vec there are several scripts to convert data located in ast folder into a representation for the neural network to learn the possible children of a parent node. It consists of four main components.

fast_merge_pickles_to_pickle.py which takes as input a folder where serialized pickle ast files are located, a language parameter (cpp/java) and an output path. After execution, the outfile is an array which consist every single ast file serialized in pickle format.
fast_pickle_file_to_nodes.py which takes as input the output from the script above and a outputfile location. This file creates a new Array which contains every node,parent and an array with the children of the current node. For example:
```
[
{node:'0',parent:'none',children[30,30,45,33]},
{node:'6',parent:'0',children[]},
...
]
```
The reason why the children '30' has as node the value '6' is because of the "double mapping" The node values are located at input/srcml_node_map.tsv and the children values are located in the map file input/maps.java.pkl. For our example, I think we don't need the extra children map, we can use a only a node_map.
train.py take as input the output of fast_pickle_file_to_nodes.py and a output location for the pretrained vectors.
pickle_files_to_training_trees.py take as input the merged pickle file from fast_merge_pickles_to_pickle.py and a outputfile location for a serialized pickle file which consists training and testing data.

As discussed in the meeting on Wednesday, we want to customize our current AST structure in order to be able to use the scripts provided in the paper : "Cross-Language Learning for Program Classification". The basic encoding of an AST of the paper looks like this:

child {
        kind: NAME
        text: "webtechmobile"
        child {
          kind: POSITION
        }
        child {
          kind: XXX
            child {
              kind: X
            }
        }
      }

Several changes to the node "Declare" were necessary. The scripts used in the repository of the paper only using the attributes child and kind, so we will also use this notation for now. That means, the following AST snippet:

{  
   "Kind":"Seq",
   "Left":{  
      "Kind":"Declare",
      "Label":"L",
      "Var":"j"
   }
   "Right":{
       ...
   }
}

will be converted into

{
   "Kind":"Seq"
   "Child":{
       "Kind":"Declare"
       "Child":{
           "Kind":"Var"
           "Child":{
               "Kind":"VARNAME"
           }
       }
       "Child":{
           "Kind":"SecurityClass"
           "Child":{
               "Kind":"H/L"
           }
       }
   }
   "Child":{
       .....
   }
}

All strings in kinds are replaced by a number regarding to a map where every possible node gets a unique id.

sagr4019 / ResearchProject