[id2vec] Paper submission - Machine Learning for Programming (ML4P)

marnovo commented 6 years ago

Story: "As a source{d} engineer I want to spread the reach of our work and its influence over key influencers in the academic area of source code analysis/machine learning on source code so that we are taken more seriously by the community and become the standard for datasets/tools on this field."

Topics and format at: http://ml4p.org/

Submission site: https://easychair.org/conferences/?conf=mlp2018

Deadlines:

Extended abstract submission deadline: 15 April 2018
Acceptance Notification: 15 May 2018
Workshop: 18–19 July 2018

Tasks: Given the dataset (which is enough for testing but not enough for reliable research),

[x] Formulate the requirements for the paper and plan of work https://github.com/src-d/backlog/issues/1192
[ ] Train identifier embeddings on the current dataset #1204
[ ] Write code for a one toy problem dataset collection #1248
[ ] Code log-regression model for a one toy problem #1249
[ ] Make experiments with toy problem
[ ] Fix all the related bugs in sourced.ml
[ ] Write the paper
[ ] Train identifier embeddings on the final dataset
[ ] Publish to modelforge

zurk commented 6 years ago

link to the paper: https://www.overleaf.com/13635304krfvjbpvgwfd git link: https://git.overleaf.com/13635304krfvjbpvgwfd (do not know how it works. Will check)

zurk commented 6 years ago

Raw thoughts about experiments for embedding quality estimation

What do we want to show? We want to show that these embeddings capture some structural/linguistic/etc information from code. We anyway should compare our embedding with others. They can be.

Usual code-unrelated embeddings like glove or word2vec.
One-hot-encoding embeddings and other "lazy" approaches.
Good idea to learn word2vec on a sequence of code identifiers and compare everything with it too.

Problems that we can consider and measure

Next identifier name prediction.
Predict identifier nesting level from the root node and from a leaf (Should be careful with outcomes since we kind of use this information in embedding calculation method).
Predict tree/line/character distance between identifiers.
Node Role prediction where the identifier is used.
???

Some of this experiments can be excluded from paper. But we will understand what can be captured by this approach and what can not.

Can use Perplexity but first I need to understand how to cook it.

Problems that we can consider but not measure

Look through the closest identifiers to a certain one (or just cluster them all). Try to find something interesting.
Visualise via t-SNE or anything else and also try to find something interesting.
Find Outliers in project identifier names.
???

Notes

For each experiment, it is good to understand why it works or not and explain it in the paper.

P.S.

I keep this list updated to add new ideas to this list.

zurk commented 6 years ago

Papers summary [on going]

Swivel: Improving Embeddings by Noticing What's Missing

https://arxiv.org/pdf/1602.02215.pdf

Summary:

Swivel is Submatrix-wise Vector Embedding Learner. It is a matrix factorization method.
Main features are scalability. So they use sharding into smaller submatrices, each containing thousands of rows and columns to process in parallel.
About Skip-gram model (skip-gram negative sampling – SGNS):
1. Implicitly factorize matrix whose values are related to the point-wise mutual information.
2. Training time proportional to the size of the corpus.
About GloVe approach:
1. Works from the precomputed corpus co-occurrence statistics.
2. Really close to Skip-gram model (but with a free bias for each word, so they train not only embedding matrix W, but vector b also).
3. Training time proportional to a number of observed co-occurrence pairs, allowing it to scale independently of corpus size.
SGNS and GloVe differ only modestly on practice. However, GloVe trains only on the observed co-occurrence statistics (no negative examples) and It can result in poor estimates for uncommon features.
Swivel
1. Works from co-occurrence statistics.
2. It makes use of the fact that many co-occurrences are unobserved in the corpus.
3. Train part of embedding vectors which correspond to submatrix (shard) of the co-occurrence matrix. Embedding vectors are shared between workers.
Result. Swivel beat others.
1. Except for the Google analogy task, SGNS outperforms GloVe.
2. Interesting about datasets: We can say in our work, that we do not have datasets for experiments like that for a code yet.
3. Nearest neighbors for some very rare words were used to show that GloVe sometimes gives strange results.

Suggesting Accurate Method and Class Names

http://homepages.inf.ed.ac.uk/csutton/publications/accurate-method-and-class.pdf

Summary:

Relevant, but poorly written. It is understandable, but need to be careful in some places.
Use embeddings for name prediction.
Introduce a variant of the model that can propose neologisms, names that have not appeared in the training corpus.
"Indeed, leading industrial experts, including Beck [9], McConnell [34], and Martin [33], have stressed the importance of identifier naming in software." -- read and link to it. extract arguments why it is so important.
"Høst et al. eloquently captured their importance: “Methods are the smallest named units of aggregated behavior in most conventional programming languages and hence the cornerstone of abstraction” [26]."
F1-score model accuracy around 60-50%
It can easily guess name prefix (like, get, 'is', "set") and some name keywords (for getPersistentManifoldPool guess that Manifold likely to be included).
"Any suggestion system has the potential to suffer from what we have called the “Clippy effect” [2], in which too many low-quality suggestions alienate the user."
language models (LMs) are used. It is a probability distribution over strings of a language. Examples: n-gram model.
Can be very useful links.
1. Need to take a look. [25] A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu. On the naturalness of software. In ICSE, 2012.
2. [41] T. T. Nguyen, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen. A statistical semantic language model for source code. In FSE, 2013.
they use what they called log-bilinear context model Logbilinear models:
1. Default Logbilinear model is P(t \mid c ) = \softmax(\exp(s\theta)(t, c)), where t -- what we predict, c -- context. s\theta = \hat{r}_c^Tq_t + b_t -- linear model in our case, but can be any function. hat{r}_c^Tq_t and q_t -- embeddings. If you have several words in context than weighted sum is used to aggregate.
2. They use NCE (noise contrastive estimation) and stochastic gradient descent for all models in this paper. Like Skip-gram model does.
3. Logbilinear context model.
  1. Long-distance context is a set of feature functions, such as “Whether any variable in the current method is named addCount”, “Whether the return type of the current method is int,” and so on.
  2. The local context includes tokens that occur both before and after t
  3. To learn an embedding, they assign each feature function to a single vector in the continuous space, in the same way as they did for tokens. And then add it to context vector.
  4. About subtokens. they define the embedding of a target vector (token) to be the sum of the embeddings of its subtokens. they predict token, as a sequence of subtokens with another log-bilinear model.
  5. Employ beam search to find subtoken with the highest probability.
Data. https://www.githubarchive.org/ is used to collect Java code dataset.
Measure quality only if the method actually gives some suggestion (if it sure). So it biased to a higher score.
tSNE for embeddings (dim=50) visualization.
A lot of refs to research about names in code, also to a failed one.

Exploring API embedding for API usages and applications

https://dl.acm.org/citation.cfm?id=3097421 https://sci-hub.tw/https://dl.acm.org/citation.cfm?id=3097421

Summary:

API2vec
Close to snippet-ranger project.
3 applications:
1. a tool that mines the pairs of API elements that share the same usage relations among them
2. API2API, a tool to automatically learn the API mappings between Java and C#.
3. Code migration, able to migrate equivalent API usages from Java to C# with up to 90.6% recall and 87.2% precision.
the sequences of API elements in API usages are natural, i.e., have high regularity. So we can learn something from it.
It is based on word2vec (CBOW).
API sequences build via hand-written Key Rules from AST.
Testing:
1. Use t-test to check if vector space distance capture similar surrounding of API calls.
2. Check that API calls of one class in closer in embedding space than in different ones. Outcomes: they are significantly smaller.
3. Also, they test that Similar Vector Offsets Reflect Similar Relations. They find interesting relations.
Mining API mappings between Java and C#
1. They build transformation matrix between embeddings. Least Square Errors (least in sense of the closest vector to the transformed one) as loss function. SGD as an optimizer.
2. Collect dataset (Java to C# API) by hands via migration tool Java2CSharp. 860 API mapping pairs.
3. Top-k accuracy is computed as result metric.
4. this method can correctly derive the corresponding API in C# .NET in 53.1% of the time with just a single suggestion. quality In the list of 5 suggested .NET APIs is 77.9%.
5. in comparison with StaMiner it is 10% better.
6. Discover new for Java2CSharp tool mappings.
They have a sensitivity Analysis for parameters.
CBOW gave slightly better accuracy than skip-gram.
API2API cannot handle the cases with n-to-1 or 1-to-n mappings.
It might not work for the pairs of libraries with much different paradigms (obvious).
A lot of refs about code-generation-like models.
Peng et al. [32] propose a deep learning model to learn vector representations for tree-based source code.
Mou et al. [33] introduce convolutional neural networks over tree structures.

Building Program Vector Representations for Deep Learning

https://arxiv.org/pdf/1409.3358.pdf

Short summary

"In this paper, we propose a novel “coding criterion” to build program vector representations based on abstract syntax trees (ASTs)" They embed AST Nodes.
compare deep approach with log-regression and SVM. classification task.
https://sites.google.com/site/learnrepresent/ -- all available here.
THE CODING CRITERION FOR PROGRAM REPRESENTATION LEARNING
1. they map symbol to vector space. what is a symbol? It can be Character (improper), Token-level (problems: programmers can declare their own unique identifiers, a long tail of unfrequent identifiers). So it is also improper. We can criticize this paper. So Nodes are the best.
2. "Such codes as ab and cd cannot be distinguished between each other." -- so they do not use this information at all.
3. Statement-level, function-level or higher -- also some problems. cannot be learned directly.
4. "The idea is that the representation of a node in ASTs should be “coded” by its children’s representations via a single neural layer." Propose a model called continuous binary tree -- how to weight children’s representations.
5. They use tree network.
6. Use negative sampling to avoid trivial solutions.
Experiments
1. We first evaluate our learned representations by nearest neighbor querying and k-means clustering -- just looking and saying, ok it is good. meh...
2. Feed the representations to the Tree based Convolutional Neural Network (TCNN) for program classification. Task: distinguish code for the one algorithmic problem from another one. Embeddings used as initialization for model params. Better than standard methods on top of a bag of words.
Overall, it is close to Tim's project (Node2vec? ) and can be useful, when we continue to work with it.

Convolutional Neural Networks over Tree Structures for Programming Language Processing

https://arxiv.org/pdf/1409.5718.pdf

Using recurrent neural networks to predict next tokens in the Java solutions

http://near.ai/articles/2017-06-01-Code-Completion-Demo/

Learning Program Embeddings to Propagate Feedback on Student Code

http://proceedings.mlr.press/v37/piech15.pdf

Short summary

They have too many students in massive online classes which can range from thousands to millions. So need to deal somehow with it.
"Exploit the fact that programs are executable — that we can evaluate any piece of code on an arbitrary input (i.e., the precondition), and observe the state after, (the postcondition)." So it is not static analysis
contributions
1. a method for computing features of code that capture both functional and stylistic elements.
2. show how our code features can be useful for automatically propagating instructor feedback to students
Embedding Hoare Triples.
1. how to represent a program in a fixed-dimension space to use typical ML methods.
2. program (function) embedding matrix convert from program precondition to program postcondition. pre-post-conditions includes the different nonlinear embedding of function f.
3. use encoding-decoding architecture for embeddings.
4. pre-post-condition -- all variables values. Collect it for every subtree in AST.
So they build one model per one problem.
Good accuracy for post-conditions.
They learn feedback model as a classifier. So they have predefined feedbacks.
Overall results differ from problem to problem, but they achieve appropriate results for such complicated problem.

Topic modeling of public repositories at scale using names in source code

https://arxiv.org/pdf/1704.00135.pdf it is our work. Can mine links from here.

Learning Natural Coding Conventions (---)

http://homepages.inf.ed.ac.uk/csutton/publications/naturalize.pdf

Others

Software Defect Prediction via Convolutional Neural Network

https://pdfs.semanticscholar.org/41f6/daaa88c8c80228ac347a46bff8a6635d72d3.pdf

Semantically enhanced software traceability using deep learning techniques

They use embeddings as part of the framework. Super short unrelaible. https://arxiv.org/pdf/1710.03129.pdf#page=31

Neural Attribute Machines for Program Generation∗

https://arxiv.org/pdf/1705.09231.pdf skip it, but it is just interesting

The Duality Between Information Embedding and Source Coding With Side Information and Some Applications

http://allegro.mit.edu/pubs/posted/journal/2003-barron-chen-wornell-it.pdf

Concrete Syntax for Objects. Domain-Specific Language Embedding and Assimilation without Restrictions.

http://sci-hub.tw/https://dl.acm.org/citation.cfm?id=1029007

eiso commented 6 years ago

@vmarkovtsev in the end is this a project that is being worked on in Q1 or pending for Q2?

vmarkovtsev commented 6 years ago

The deadline is April 15th so the answer is both.

eiso commented 6 years ago

@zurk is there a preliminary title or abstract, based on the papers you mention all I can figure out is that this related to embeddings of identifiers? Is this a paper on id2vec?

zurk commented 6 years ago

@eiso, abstract not ready yet. You can find a link to article plan in first comment https://github.com/src-d/backlog/issues/1166#issuecomment-363457661. And yes, it is about id2vec.

eiso commented 6 years ago

Just wanted to add a note here that the main co-chair of this conference is Prodo.ai, they consider us their main competitors and might be biased in the review. Just FYI.

bzz commented 6 years ago

It looks like in the literature, the paper/results from Sutton group http://groups.inf.ed.ac.uk/cup/naturalize highlighted above "Suggesting Accurate Method and Class Names" might be the most comparable to id2vec work.

It showcases something that later became VarNaming task (variable name predictions) and is generally referred as The Work on "learning distributed representations of variables using all their usages to predict their names".

"Deep Learning Similarities from Different Representations of Source Code" focused on clone detection task but might be relevant as well https://2018.msrconf.org/event/msr-2018-papers-deep-learning-similarities-from-different-representations-of-source-code as it compares different representations of the source code, including section 3.2.1 Identifiers and 3.2.2 AST

zurk commented 6 years ago

thank you, @bzz! Good links and material for introduction part of the paper.

zurk commented 6 years ago

I will attend the meetup and talk about swivel and id2vec. https://github.com/src-d/backlog/issues/1272. The paper itself is pending for now.

vmarkovtsev commented 5 years ago

So this never happened. We submitted the paper about identifier splitting to ML4P, which was accepted, and we presented it in person with @warenlg The id2vec paper is still fun to write, though we need to finish some old legacy first.

src-d / ml-backlog