src-d / reading-club

Paper reading club at source{d}
Creative Commons Attribution Share Alike 4.0 International
115 stars 12 forks source link

Next paper candidates: 1 Nov #91

Closed bzz closed 4 years ago

bzz commented 4 years ago

Next paper candidates

Let's propose papers to study next! All papers mentioned in the comments of this issue will be listed in the next vote.

Last session runner-up(s)

bzz commented 4 years ago

Software Engineering for Machine Learning: A Case Study by MS from ICSE'19

Recent advances in machine learning have stimulated widespread interest within the Information Technology sector on integrating AI capabilities into software and services. This goal has forced organizations to evolve their development processes. We report on a study that we conducted on observing software teams at Microsoft as they develop AI-based applications. We consider a nine-stage workflow process informed by prior experiences developing AI applications (e.g., search and NLP) and data science tools (e.g. application diagnostics and bug reporting). We found that various Microsoft teams have united this workflow into preexisting, well-evolved, Agile-like software engineering processes, providing insights about several essential engineering challenges that organizations may face in creating large-scale AI solutions for the marketplace. We collected some best practices from Microsoft teams to address these challenges. In addition, we have identified three aspects of the AI domain that make it fundamentally different from prior software application domains: 1) discovering, managing, and versioning the data needed for machine learning applications is much more complex and difficult than other types of software engineering, 2) model customization and model reuse require very different skills than are typically found in software teams, and 3) AI components are more difficult to handle as distinct modules than traditional software components – models may be “entangled” in complex ways and experience non-monotonic error behavior. We believe that the lessons learned by Microsoft teams will be valuable to other organizations.

m09 commented 4 years ago

Deep Learning Anti-patterns from Code Metrics History, preprint of ICSME 2019.

Anti-patterns are poor solutions to recurring design problems. Number of empirical studies have highlighted the negative impact of anti-patterns on software maintenance which motivated the development of various detection techniques. Most of these approaches rely on structural metrics of software systems to identify affected components while others exploit historical information by analyzing co-changes occurring between code components. By relying solely on one aspect of software systems (i.e., structural or historical), existing approaches miss some precious information which limits their performances. In this paper, we propose CAME (Convolutional Analysis of code Metrics Evolution), a deep-learning based approach that relies on both structural and historical information to detect anti-patterns. Our approach exploits historical values of structural code metrics mined from version control systems and uses a Convolutional Neural Network classifier to infer the presence of anti-patterns from this information. We experiment our approach for the widely known God Class anti-pattern and evaluate its performances on three software systems. With the results of our study, we show that: (1) using historical values of source code metrics allows to increase the precision; (2) CAME outperforms existing static machine-learning classifiers; and (3) CAME outperforms existing detection tools.

m09 commented 4 years ago

When Deep Learning Met Code Search

There have been multiple recent proposals on using deep neural networks for code search using natural language. Common across these proposals is the idea of embedding code and natural language queries, into real vectors and then using vector distance to approximate semantic correlation between code and the query. Multiple approaches exist for learning these embeddings, including unsupervised techniques, which rely only on a corpus of code examples, and supervised techniques, which use an aligned corpus of paired code and natural language descriptions. The goal of this supervision is to produce embeddings that are more similar for a query and the corresponding desired code snippet. Clearly, there are choices in whether to use supervised techniques at all, and if one does, what sort of network and training to use for supervision. This paper is the first to evaluate these choices systematically. To this end, we assembled implementations of state-of-the-art techniques to run on a common platform, training and evaluation corpora. To explore the design space in network complexity, we also introduced a new design point that is a minimal supervision extension to an existing unsupervised technique. Our evaluation shows that: 1. adding supervision to an existing unsupervised technique can improve performance, though not necessarily by much; 2. simple networks for supervision can be more effective that more sophisticated sequence-based networks for code search; 3. while it is common to use docstrings to carry out supervision, there is a sizeable gap between the effectiveness of docstrings and a more query-appropriate supervision corpus. The evaluation dataset is now available at arXiv:1908.09804

m09 commented 4 years ago

Results are out. We'll study When Deep Learning Met Code Search :tada: