src-d / reading-club

Paper reading club at source{d}
Creative Commons Attribution Share Alike 4.0 International
115 stars 12 forks source link

Next paper candidates: 18 Oct #83

Closed bzz closed 4 years ago

bzz commented 4 years ago

Next paper candidates

Let's propose papers to study next! All papers mentioned in the comments of this issue will be listed in the next vote.

Last session runner-up(s)

-

sara-02 commented 4 years ago

node2vec: Scalable Feature Learning for Networks Stanford, 2016

Prediction tasks over nodes and edges in networks require careful effort in engineering features used by learning algorithms. Recent research in the broader field of representation learning has led to significant progress in automating prediction by learning the features themselves. However, present feature learning approachesare not expressive enough to capture the diversity of connectivity patterns observed in networks. Here we proposenode2vec, an algorithmic framework for learning continuous feature representations for nodes in networks. Innode2vec, we learn a mapping of nodes to a low-dimensional space of features that maximizes the likelihood of preserving network neighbourhoods of nodes. We define a flexible notion of a node network neighbourhood and design a biased random walk procedure, which efficiently explores diverse neighbourhoods. Our algorithm generalizes prior work which is based on rigid notions of network neighbourhoods, and we argue that the added flexibility in exploring neighbourhoods is the key to learning richer representations. We demonstrate the efficacy of ofnode2vecover existing state-of-the-art techniques on multi-label classification and link prediction in several real-world networks from diverse domains. Taken together, our work represents a new way for efficiently learning state-of-the-art task-independent representations in complex networks.

bzz commented 4 years ago

Software Engineering for Machine Learning: A Case Study by MS from ICSE'19

Recent advances in machine learning have stimulated widespread interest within the Information Technology sector on integrating AI capabilities into software and services. This goal has forced organizations to evolve their development processes. We report on a study that we conducted on observing software teams at Microsoft as they develop AI-based applications. We consider a nine-stage workflow process informed by prior experiences developing AI applications (e.g., search and NLP) and data science tools (e.g. application diagnostics and bug reporting). We found that various Microsoft teams have united this workflow into preexisting, well-evolved, Agile-like software engineering processes, providing insights about several essential engineering challenges that organizations may face in creating large-scale AI solutions for the marketplace. We collected some best practices from Microsoft teams to address these challenges. In addition, we have identified three aspects of the AI domain that make it fundamentally different from prior software application domains: 1) discovering, managing, and versioning the data needed for machine learning applications is much more complex and difficult than other types of software engineering, 2) model customization and model reuse require very different skills than are typically found in software teams, and 3) AI components are more difficult to handle as distinct modules than traditional software components – models may be “entangled” in complex ways and experience non-monotonic error behavior. We believe that the lessons learned by Microsoft teams will be valuable to other organizations.

ncordon commented 4 years ago

PathMiner: A Library for Mining of Path-Based Representations of Code, presented at MSR'19. The tool has evolved from PathMiner to astminer now

One recent, significant advance in modeling source code for machine learning algorithms has been the introduction of path-based representation – an approach consisting in representing a snippet of code as a collection of paths from its syntax tree. Such representation efficiently captures the structure of code, which, in turn, carries its semantics and other information. Building the path-based representation involves parsing the code and extracting the paths from its syntax tree; these steps build up to a substantial technical job. With no common reusable toolkit existing for this task, the burden of mining diverts the focus of researchers from the essential work and hinders newcomers in the field of machine learning on code.

In this paper, we present PathMiner – an open-source library for mining path-based representations of code. PathMiner is fast, flexible, well-tested, and easily extensible to support input code in any common programming language. Preprint [https://doi.org/10.5281/zenodo.2595271]; released tool [https://doi.org/10.5281/zenodo.25952

ncordon commented 4 years ago

A General Path-Based Representationfor Predicting Program Properties published at PLDI 2018 Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation

Predicting program properties such as names or expression types has a wide range of applications. It can ease the task of programming, and increase programmer productivity. A major challenge when learning from programs is how to represent programs in a way that facilitates effective learning.

We present a general path-based representation for learning from programs. Our representation is purely syntactic and extracted automatically. The main idea is to represent a program using paths in its abstract syntax tree (AST). This allows a learning model to leverage the structured nature of code rather than treating it as a flat sequence of tokens.

We show that this representation is general and can: (i) cover different prediction tasks, (ii) drive different learning algorithms (for both generative and discriminative models), and (iii) work across different programming languages.

We evaluate our approach on the tasks of predicting variable names, method names, and full types. We use our representation to drive both CRF-based and word2vec-based learning, for programs of four languages: JavaScript, Java, Python and C#. Our evaluation shows that our approach obtains better results than task-specific handcrafted representations across different tasks and programming languages.

m09 commented 4 years ago

Slice-based Learning: A Programming Model for Residual Learning in Critical Data Slices

In real-world machine learning applications, data subsets correspond to especially critical outcomes: vulnerable cyclist detections are safety-critical in an autonomous driving task, and “question” sentences might be important to a dialogue agent’s language understanding for product purposes. While machine learning models can achieve high quality performance on coarse-grained metrics like F1-score and overall accuracy, they may underperform on critical subsets—we define these as slices, the key abstraction in our approach. To address slice-level performance, practitioners often train separate “expert” models on slice subsets or use multi-task hard parameter sharing. We propose Slice-based Learning, a new programming model in which the slicing function (SF), a programming interface, specifies critical data subsets for which the model should commit additional capacity. Any model can leverage SFs to learn slice expert representations, which are combined with an attention mechanism to make slice-aware predictions. We show that our approach maintains a parameter-efficient representation while improving over baselines by up to 19.0 F1 on slices and 4.6 F1 overall on datasets spanning language understanding (e.g. SuperGLUE), computer vision, and production-scale industrial systems.

bzz commented 4 years ago

Assessing the Generalizability of code2vec Token Embeddings from upcoming ASE'19

Many Natural Language Processing (NLP) tasks, such as sentiment analysis or syntactic parsing, have benefited from the development of word embedding models. In particular, regardless of the training algorithms, the learned embeddings have often been shown to be generalizable to different NLP tasks. In contrast, despite recent momentum on word embeddings for source code, the literature lacks evidence of their generalizability beyond the example task they have been trained for. In this experience paper, we identify 3 potential downstream tasks, namely code comments generation, code authorship identification, and code clones detection, that source code token embedding models can be applied to. We empirically assess a recently proposed code token embedding model, namely code2vec’s token embeddings. Code2vec was trained on the task of predicting method names, and while there is potential for using the vectors it learns on other tasks, it has not been explored in literature. Therefore, we fill this gap by focusing on its generalizability for the tasks we have identified. Eventually, we show that source code token embeddings cannot be readily leveraged for the downstream tasks. Our experiments even show that our attempts to use them do not result in any improvements over less sophisticated methods. We call for more research into effective and general use of code embeddings.

bzz commented 4 years ago

Going to post a poll today

bzz commented 4 years ago

the vote is up

bzz commented 4 years ago

Assessing the Generalizability of code2vec Token Embeddings was chosen 🎉