microsoft / graph-based-code-modelling

Code for "Generative Code Modeling with Graphs" (ICLR'19)
MIT License
170 stars 38 forks source link

Implementation of VarNaming task from ICLR '18 #2

Closed haighal closed 5 years ago

haighal commented 5 years ago

Hi,

My name is Alex Haigh, and I'm a master's student at Stanford. For a project, I'm working to reproduce (and hopefully extend) some of your results on VarNaming from your '18 ICLR paper. My understanding of your model for that task is that you:

  1. Replace each instance of your target variable with a <SLOT> token
  2. Represent every other variable as a concatenation of the average of the (learnable) embedding for each subtoken in its name and the type of the variable (as described on the top of p. 5) here
  3. Run message passing with a GGNN for 8 timesteps, using the program graph
  4. Average the final representation of every <SLOT> token
  5. Use this as input to a GRU decoder that outputs the variable name as a sequence of subtokens.

I found the dataset here, and it looks like it's in the format digestible by utils/tensorise.py. Similarly, the model you use for VarNaming seems to be the Graph2SeqModel.

So, is this all you need to do to reproduce the results?

Just wanted to make sure I'm looking in the right place, and would also appreciate any other tips you have. Also, what modifications did you make to the model based on Cvitovic et al? And is there a way you can compare results with/without those modifications?

Thanks!

mmjb commented 5 years ago

Heya,

This sounds mostly right (there may be some patching required in tensorise to look at the variable name instead of the target expression). The code released here is indeed quite similar to what we used for the ICLR'18 model (it's the basis before some refactorings / extensions for the ICLR'19 paper). The Graph2SeqModel is a bit different from what we used, but you should be able to retool it fairly easily.

The only obvious problem I see is that I removed the subtokenization bits at some point (when introducing the CharCNNs for node labels). If you really want to reproduce the original result, you would need to patch this back in (essentially, look for uses of the cg_node_label_embedding_style hyperparameter to find places you need to modify). Apart from that, I see no obvious issue, but I also haven't tried this ... so good luck, and come back with questions? :-)

Re the Cvitovic et al. extensions: These are controlled by the cg_add_subtoken_nodes hyperparameter (for adding the meta-nodes for each used subtoken) and the cg_node_label_embedding_style hyperparameter (which is CharCNN for their best-performing configuration).

I think we also added residual GGNN connections, which you should be able to turn off with cg_ggnn_residual_connections: {}.

Good luck!

haighal commented 5 years ago

Thanks for your help here! Seriously appreciate it.

Right now, I'm working with tensorize.py to process the data, and I'm getting an error in the _load_metadata_from_sample function in contextgraphmodel.py. It seems like the context graphs in the dataset don't have EdgeValues fields, so the program crashes at the line

for edge_type, values in raw_sample['ContextGraph']['EdgeValues'].items():

After poking around a bit, I'm also unsure what the cg_edge_value_sizes field of the metadata does, so I'm not sure how to patch this up. Any ideas? Did you change the way you stored the graphs between papers? An example of the program graph in the old dataset is here

mmjb commented 5 years ago

Oh, right. We nowadays support values on edges in GNNs as well (i.e., they not only have an edge type, such as 'NextLexicalUse', but also a value, such as the distance covered by the edge in lines). It's a purely optional feature (and most edges don't have values anyway), so you can just comment this out (or just set raw_sample['ContextGraph']['EdgeValues'] = {} somewhere above the line that's crashing).

On Tue, Apr 23, 2019 at 7:12 AM haighal notifications@github.com wrote:

Thanks for your help here! Seriously appreciate it.

Right now, I'm working with tensorize.py to process the data, and I'm getting an error in the _load_metadata_from_sample function. It seems like the context graphs in the dataset don't have EdgeValues fields, so the program crashes at the line

for edge_type, values in raw_sample['ContextGraph']['EdgeValues'].items():

After poking around a bit, I'm also unsure what the cg_edge_value_sizes field of the metadata does, so I'm not sure how to patch this up. Any ideas? An example of the program graph is here https://github.com/Microsoft/graph-based-code-modelling/files/3106041/sample-program-graph.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Microsoft/graph-based-code-modelling/issues/2#issuecomment-485657253, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKT62FVMNZSVDAX2XEWMX3PR2SFTANCNFSM4HGZYHXQ .

haighal commented 5 years ago

Awesome. As you mentioned in the first post, the decoder needs to look at the variable name instead of the expression that you're trying to generate. In SeqDecoder.load_metadata_from_sample, it looks at the SymbolLabels field of the sample, which doesn't exist in the graph samples I have.

From the sample graph in the repo, SymbolLabels is just keeping a count of how often each subtoken appears in an answer in order to specify the output vocabulary. The most similar thing in the dataset from ICLR '18 is the field called SymbolCandidates, which per the description is a:

List of type-correct in-scope variables that can be used in the considered slot. Each variable represented by a dictionary in which "SymbolDummyNode" identifies the node in "ContextGraph" corresponding to the candidate, "IsCorrect" marks if the candidate is the correct choice, and "SymbolName" gives the name of the variable candidate [Unneeded for task, used for debugging].

My plan is to just find the correct variable name and split it up into subtokens based on camelCase (as you say in the paper: the name inputStreamBuffer is treated as the sequence [input, stream, buffer]). This seems reasonable, but I want to get your input first to make sure that is in fact what you do. Also, is there any chance you've already written a utility that tokenizes strings into camelCase somewhere?

Looking forward to training time, there's also a Productions field that gets used in SeqDecoder.load_data_from_sample and in collect_token_seq (in model.py). This seems like an artifact of expanding your program graph based on production rules and is unnecessary for the simpler VarNaming task - can I just get rid of that bit and replace it with the actual answer tokens?

Thanks again for the help!

mmjb commented 5 years ago

Oh, I think you'll have to rewrite the data loading in SeqDecoder a bit (but it should be fairly trivial to do so) - the data for the ICLR'19 paper only had expression trees, so the data loading had to do all of this stuff to extract a sequence of tokens. There's two changes required: (1) In load_metadata_from_sample, use dpu_utils.codeutils.split_identifier_into_parts to do the subtokenization. Then use the resulting tokens to update the decoder_token_counter (2) In load_data_from_sample, do the same thing instead of calling collect_token_seq (and just remove collect_token_seq)

Marc

On Wed, Apr 24, 2019 at 3:23 AM Alex Haigh notifications@github.com wrote:

Awesome. As you mentioned in the first post, the decoder needs to look at the variable name instead of the expression that you're trying to generate. In SeqDecoder.load_metadata_from_sample, it looks at the SymbolLabels field of the sample, which doesn't exist in the graph samples I have.

From the sample graph in the repo, it's just keeping a count of how often each subtoken appears in an answer in order to specify the output vocabulary. The most similar thing in the dataset from ICLR '18 is the field called SymbolCandidates, which per the description is a:

List of type-correct in-scope variables that can be used in the considered slot. Each variable represented by a dictionary in which "SymbolDummyNode" identifies the node in "ContextGraph" corresponding to the candidate, "IsCorrect" marks if the candidate is the correct choice, and "SymbolName" gives the name of the variable candidate [Unneeded for task, used for debugging].

My plan is to just find the correct variable name and split it up into subtokens based on camelCase (as you say in the paper: the name inputStreamBuffer is treated as the sequence [input, stream, buffer]). This seems reasonable, but I want to get your input first to make sure that is in fact what you do. Also, is there any chance you've already written a utility that tokenizes strings into camelCase somewhere?

Looking forward to training time, there's also a Productions field that gets used in SeqDecoder.load_data_from_sample and in collect_token_seq (in model.py). This seems like an artifact of expanding your program graph based on production rules and is unnecessary for the simpler VarNaming task - can I just get rid of that bit and replace it with the actual answer tokens?

Thanks again for the help!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Microsoft/graph-based-code-modelling/issues/2#issuecomment-486041990, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKT62D5VPLENRXQTTL2TXTPR7ABHANCNFSM4HGZYHXQ .

haighal commented 5 years ago

Just wanted to say thanks again for the help here - we have the pipeline working and were able to overfit a small dataset. Now training on the full dataset!

mmjb commented 5 years ago

Great, thanks for keeping me in the loop. If there's an easy way to merge the changes / some way I should point to your modified version, I'm happy to do so.

On a related note, I realised that the dataset that you are using is not quite what we used for VarNaming in the paper. We released the extracted data for VarMisuse, where we have the graph for the full context minus a hole. From what I can see, you are trying to backfill the name for that, where the correct name is usually somewhere in the context (in other uses). For VarNaming, we considered a slightly setting in which all uses of a variable in a method where replaced by a HOLE node, and we would use the averaged representation of these HOLE nodes to kick off the sequence decoder. This is in some way easier than your task (because it has access to several usages of the variable to be named), and harder in others (because in your setting, the right name appears somewhere in context).

If you are keen on having a dataset like we used, I can probably dig out the code for the data extractor and release it here.

On Wed, May 1, 2019 at 6:19 AM Alex Haigh notifications@github.com wrote:

Just wanted to say thanks again for the help here - we have the pipeline working and were able to overfit a small dataset. Now training on the full dataset!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Microsoft/graph-based-code-modelling/issues/2#issuecomment-488212852, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKT62EAIUCJBFJJ3C3CDSTPTER5RANCNFSM4HGZYHXQ .

haighal commented 5 years ago

Yeah, if you could share that, that'd be great! Although rather than re-compiling the dataset, it might be easier just to traverse the existing graph instance and replace instances of the answer token (if that makes any sense).

My edits are a fork of this repo, and when I get a chance I'll make a little readme explaining my changes/the steps I did to make it work.

haighal commented 5 years ago

Hey Marc,

Just following up here - would love suggestions on the easiest way to create a dataset like the one you used for VarNaming.

Also: do you remember the hyperparameters you used when training the VarNaming model (namely how many epochs)? The default patience value (5) gives up after ~20 epochs since there isn't any improvement from epochs 15-20 when I run it.

mmjb commented 5 years ago

Heya,

I've just pushed a new branch (dev/mabrocks/varnaming) with the data extractor for variable naming (copy & pasted from our internal repo from back then).

The problem with using the existing data is that the graphs we are extracting are truncated. More precisely, we have the notion of a "hole" (e.g., the removed expression) and we then copy a subgraph around this hole, based on the idea that the GNN in the end will only use K=8 propagation steps anyway. For VariableNaming, we have more holes, corresponding to different uses of the variable, and so the graphs are richer than what would you get from just using the dataset we generated for expression generation...

Re hypers, @mallamanis should know the details...

mallamanis commented 5 years ago

I believe that we used the default hyperparameters here. I am sure that a more exhaustive hyper-optimization could have helped here.

Note, that for variable naming we have used a GRU decoder that generated the variable name per-subtoken. This isn't included in this repo.

haighal commented 5 years ago

Thanks for all the help (and the clarifications on the nuances of your dataset), guys! I'll check out what you did with the data extractor and look a little more into doing an exhaustive hyperparameter search. How often did you encounter graphs with diameter > 8 where this would make a difference?

@mallamanis - I thought the SeqDecoder in this repo did what you're saying. Did you use something different?

mmjb commented 5 years ago

Graphs with a substantial diameter are the norm -- this truncation is relevant for more or less every sample in the dataset. I'm currently at ICLR and am having a hard time to do proper hacking, but I can try to look into recreating the ICLR'18 VarNaming dataset next week.

haighal commented 5 years ago

Awesome - would be very much appreciated. Have fun at ICLR!

mallamanis commented 5 years ago

@haighal You're right! I missed that SeqDecoder could be used for variable naming too.

haighal commented 5 years ago

Great! Thanks for the confirmation.

I know it's a slightly different task than what you've published (I'm really doing variable naming with only 1 masked instance and thus a different/truncated subgraph), but I'm still hoping to get interesting results with this.

I trained the model pretty quickly, but testing is taking ages: I ran the test script with 4 parallel workers on Friday afternoon and it still isn't complete (it's now Tuesday at 4 pm here in California). Do you guys have any idea why it might be taking so long/where there might be inefficiencies in the test script? I'm not tensorizing the test graphs before running test.py, but I don't think that's the issue because model.test() seems to do that for us.

Alex

haighal commented 5 years ago

Also, after I can get it working with C# code, I'm hoping to apply a similar model to a Python dataset. This obviously limits the scope of edge types I can use, and I was planning to start with only the syntax edges (Child and NextToken) and not use variable type information (since it's unavailable in Python). Two questions on that end:

  1. What's the easiest way to remove the type hierarchy from the model? It seems like we can just set 'cg_node_type_embedding_size' = 0, but is there anywhere else in the tensorisation pipeline we'd need to make substantial modifications?

  2. How do you actually compute the NextToken edges from the AST? I found this in the paper but am still a bit unsure (do you just arbitrarily link the terminals based on the order in which they're traversed?):

We use Child edges to connect nodes according to the AST. As this does not induce an order on children of a syntax node, we additionally add NextToken edges connecting each syntax token to its successor. An example of this is shown in Fig. 2a.

Thanks!

mmjb commented 5 years ago

Getting the NextToken edges right is indeed a bit harder to do with the Python parser (in Roslyn, we can (a) rely on the fact that the source code visitor visits things in the right order, see https://github.com/microsoft/graph-based-code-modelling/blob/a6a984b7d0b5965ff93477e502a8395dd036adf0/DataExtraction/SourceGraphExtractionUtils/Utils/SourceGraph.cs#L273)

However, Patrick, whom Miltos and I worked with on a recent ICLR paper, also released Python2Graph code, so you might want to look into simply reusing that: https://github.com/CoderPat/structured-neural-summarization/blob/master/parsers/sourcecode/barone/ast_graph_generator.py

Re type embeddings: Setting the size to 0 should just work, but it may require making a few things more robust in the tensorisation pipeline (i.e., access to the types would need to become optional). However, this data is just passed through, so the changes required should be fairly minor.

Marc

haighal commented 5 years ago

Perfect, thanks for sharing! And do you have any ideas on why testing might be so slow (two posts above)?

mmjb commented 5 years ago

Oh, sorry, yes. I don't have a clear idea, but I have a hunch. Concretely, I observed in the experiments for the ICLR'19 paper that the sequence decoder would get slower with time; in the end, I think I traced it to the tensorflow session collecting more and more constants or something. This problem resolved itself when I introduced per-file parallelisation (https://github.com/microsoft/graph-based-code-modelling/blob/master/Models/utils/test.py#L53) because that is creating a new model copy (with its own tf.Session) for every file that it's testing on. Now, my test files happened to be quite small; if yours are large, the problem may continue to exist for you...

So, to make that actionable: (1) Run test from the start, track how fast it's making progress. If you see that it's getting slower: Yay, you're running into this issue. (2) If this is indeed your problem, instead of principled fix, you can just take test test files and quickly split them into smaller files and everything should magically work...

haighal commented 5 years ago

Hey Marc,

Just wanted to say thanks again for your very helpful comments here (that was indeed the issue I was having at test time). We got an implementation working on Python ASTs using only the Syntax edges and got results pretty similar to what you reported in your ablation from the ICLR '18 paper!

Accuracy@1: 34.5671%
Accuracy@3: 42.4864%
Accuracy@5: 45.6902%

The code lives at https://github.com/haighal/graph-based-code-modelling/ and has a thorough summary of both (1) our modifications to your code base and (2) the steps to generate our results. Let me know if you have any questions - my email is haighal [at] cs.stanford.edu

Cheers, Alex