tech-srl / code2seq

Code for the model presented in the paper: "code2seq: Generating Sequences from Structured Representations of Code"
http://code2seq.org
MIT License
550 stars 165 forks source link

Data details #64

Closed VHellendoorn closed 4 years ago

VHellendoorn commented 4 years ago

Hi,

Thanks for sharing your code and dataset. I was hoping to try a few other types of models on the data by reconstructing the dataset from the original Java files, but got a bit stuck trying to replicate the exact values. For instance, I end up with ca. 60K more methods from crawling the (Java) test files' ASTs than appear to have been included in the dataset (no doubt for a good reason, like interface methods).

Any chance you could add more specific details for those wanting to compare the results through different preprocessing routes? Especially the exact method FQNs (with file path) for each sample in the train/valid/test would be very valuable. Currently, the Baselines section does have links to files with the (valid/test) method-name & tokens, but it's hard to align these back to the original code. Besides the FQNs, it would also be great to release the exact input and target vocabulary, since the metrics are strongly dependent on esp. the latter.

Thanks! -Vincent

urialon commented 4 years ago

Hi Vincent, Thank you for your interest in code2seq!

Regarding file paths - I just committed a change to the JavaExtractor that adds a "--json_output" flag to the java process. So if you add this flag to the line that runs a new java process (and then comment-out the code2seq-specific preprocessing steps, from this line onwards) - you will get an output where every line is a JSON that represents a single example. Its "name" field is the target method name that should be predicted, its "textContext" is the code of the method, and "filePath" is the file path. For example:

{"name":"saturate|child|id","textContent":"private Integer METHOD_NAME(int childId) {\n    return Math.min(childId, m_CommandLineValues.MaxChildId);\n}","filePath":"/Users/urialon/PycharmProjects/code2seq/JavaExtractor/JPredict/src/main/java/JavaExtractor/FeatureExtractor.java"}

Does that help? If you need to print additional information, you can propagate this information to be a field of this ProgramFeatures class and they will be printed automatically in the JSON.

Regarding vocabularies - maybe they can be extracted from OpenNMT-py's saved model?

VHellendoorn commented 4 years ago

Hi Uri,

Thanks for adding that, that's definitely helpful; it looks like I can align my extracted methods quite well with yours this way. With regards to the vocabulary, I did download the model, but unpickling it is giving me a series of issues related to ONMT + TorchText (specifically, the latter fixed a bug in a version that is not compatible with the former, so now there is no way to load the model file). Before I go deeper down that rabbit hole, any chance you have the files laying around? I imagine they'd be relevant pretty regularly.

Thanks! -Vincent

urialon commented 4 years ago

Hi Vincent, I just uploaded the vocabulary file that OpenNMT-py creates when creating the dataset. I think this is the same one that I used for the seq2seq baselines.

https://code2seq.s3.amazonaws.com/lstm_baseline/java-large.vocab.pt

Does this help? Uri

VHellendoorn commented 4 years ago

Hi Uri,

That's awesome, thanks for adding that. Looks like I'm almost set (the JSON helped me get the right alignment); just one last question: what was the vocabulary count cut-off (especially on the target side) that you used for java-large in your ICLR paper? The files you shared appear to have the counts of all words, but I assume a cut-off (somewhere in the ~14 count range by the looks of it) was used. It would also be helpful to know how you calculate accuracy on OOV tokens on the decoder side at inference time.

I appreciate you taking the time for all these questions! -Vincent

urialon commented 4 years ago

Hey, The preprocess.py script of OpenNMT that creates this vocabulary file - already gets the vocabulary size as input. In other words - the vocabulary file is already cut-off. I used source vocab size of 190K and target vocab of 27K, to match the vocabularies of my code2seq model.

Regarding accuracy on OOV tokens - I'm not sure what exactly do you mean. We simply compared the ground truth subtoken strings with the predicted subtoken strings and computed precision/recall/F1. I didn't measure accuracy specifically on OOV tokens separately.

By the way, I didn't use a copy mechanism in the seq2seq baselines and neither in my code2seq model. This can improve the results of both of them (we actually show the effect of copying in the SLM paper, for the task of code completion). So by definition, my code2seq model couldn't predict OOV tokens. In the seq2seq baselines I did use --replace_unk, which acts as a copy mechanism but only at test time (but is weaker than a trained copying mechanism).

I hope that helps? If you have additional questions feel free to ask. Uri

VHellendoorn commented 4 years ago

Hi Uri,

Thanks for that information, I'll stick to the same cut-offs. Regarding OOV, I only meant that predicting method names from a closed vocabulary of subtokens inevitably means that some method names at test time are not included; the natural thing to do is to count mispredictions on those as a strike against the model, which I imagine you did too.

With that, I think I've got all the information I needed, thanks again!

-Vincent

urialon commented 4 years ago

Of course, if the ground truth contained OOV subtokens (for example, ["get","foo"] where "foo" is OOV) - it is automatically counted as a mispredictions of our model. The seq2seq model still had a chance of predicting "foo" correctly using the --replace_unk.