Fix code style faults using 🤖
Overview • How To Use • Science • Contributions • License
This is a collection of analyzers for Lookout - the open source framework for code review intelligence. You can run them directly on your Git repositories, but most likely you don't want that and instead just use the upcoming code review product from source{d}. Overall, this project is a mix of research ideas and their applications to solving real problems. Consider it as an experiment at this stage.
Currently, there is the "format" analyzer working and the one "typos" incubating. All the current and the future ones are based on machine learning and never contain any hidden domain knowledge such as static code analysis rules or human-written pattern matchers.
lookout.style.format
- mine "white box" code formatting rules with machine learning and validate new code against them.lookout.style.typos
- find typos in identifier names, using the dataset of 60 million identifiers already present in open source repositories on GitHub."format" analyzer supports only JavaScript for now, though it is not nailed to that language and is based on the language-agnostic Babelfish parser. Everything is written in Python.
There are several ways to run style-analyzer:
The implemented analyzers are driven by bleeding edge research. One day we will write papers about them, but first we want to focus on making them work. Below are brief descriptions of how the analyzers are designed.
The core of the format analyzer is a language model: we learn without labeled data, just by modeling the existing format in a repository given the current code at a given point in a file. We then check whether the proposed changes follow those learnt formatting conventions. The training algorithm is summarized below.
The application algorithm is much simpler - we take the rules and apply them. However, there are several quirks:
We take the dataset with identifiers extracted from Public Git Archive. We split them (blog post is pending early November). There are frequencies present for each "atom", so we consider top frequent ones as ground truth. For each checked "atom", we take it's embedding computed with fasttext, refine it with a deep fully-connected neural network, generate candidates with symspell and rank them with XGBoost.
Contributions are very welcome and desired! Please follow the code of conduct and read the contribution guidelines. If you want to add a new cool style fixer backed by machine learning, it is always a good idea to discuss it on Slack.
AGPL-3.0, see LICENSE.md.