[PROPOSAL] Source code transformations

bzz commented 6 years ago

Idea comes from https://github.com/src-d/blog/pull/233#discussion_r209758185

Title: Source code match/traverse/transform APIs
Author(s): Alex, ?
Short description: An overview of existing approaches to match/traverse/transform source code
Categories: language analysis
Deadlines: no

Management

This section will be filled by @campoy.

State: (proposed | writing | written | published)
Scheduled:
Link to post:

Social Media

Wording for tweet:
Hashtags:
Subreddits:

NOTE Please write in short lines so the review is easier to do.

Preliminary content comes from prev. blog pos

``` {{% center %}} … {{% /center %}} ## Technical details Based on the internal success-story of C++ with [ClangMR tool](https://research.google.com/pubs/pub41342.html) for matching/traversing/transforming Abstract Syntax Tree (AST) at scale, a similar tooling was built for Java. {{% youtube ZpvvmvITOrk %}} Project [Error-Prone](https://github.com/google/error-prone) is a compiler extension that is able to perform arbitrary analysis on the *fully typed AST*. One thing to notice is that one can not get such input by using only a parser even as advanced as [babelfish](https://doc.bblf.sh/). Running a full build would be required in order to do things like symbol resolution. In the end, after running a number of checker plugins Error-Prone outputs simple text replacements with suggested code fixes. The project is open source and is well documented in a [number](https://research.google.com/pubs/pub38275.html) of [papers](https://research.google.com/pubs/pub41876.html). Another closed source tool was built to scale application of those fixes to the whole codebase, called JavacFlume — which I would guess looks something like an Apache Spark job that applies patches in some generic format. Here is an example of how a full pipeline looks for C++: {{% grid %}} {{% caption src="https://cdn-images-1.medium.com/max/4224/1*KpJ5fj4njR1HTDfzhLCQkg.png" title="ClangMR processing pipeline ilustration"%}} “Large-Scale Automated Refactoring Using ClangMR” by [Hyrum Wright](https://research.google.com/pubs/HyrumWright.html), Daniel Jasper, Manuel Klimek, [Chandler Carruth](https://research.google.com/pubs/ChandlerCarruth.html), Zhanyong Wan {{% /caption %}} {{% /grid %}} Although it is not disclosed, an attentive reader might have noticed that **Compilation Index** part of the pipeline is very similar to a [Compilation Database](https://kythe.io/docs/kythe-compilation-database.html) in the open source Kythe project. It might be interesting to take a closer look at the example of an API for AST query and transformation for C++. ### C++ Example > *rename all calls to Foo::Bar with 1 argument to Foo::Baz, independent of the name of the instance variable, > or whether it is called directly or by pointer or reference* {{% grid %}} {{% grid-cell %}} ![API example: invoke a callback function on call to Foo:Bar](https://cdn-images-1.medium.com/max/2000/1*vOYemTlJ2QZyzXvizSy5Og.png) {{% /grid-cell %}} {{% grid-cell %}} This fragment will invoke a callback function on any occurrence of the call to *Foo:Bar* with single argument. {{% /grid-cell %}} {{% /grid %}} {{% grid %}} {{% grid-cell %}} ![API example: replace matching text of the function name with the "Baz"](https://cdn-images-1.medium.com/max/2116/1*JiUgO-gimsIi2JpRB9LYeg.png) {{% /grid-cell %}} {{% grid-cell %}} This callback will generate a code transformation: for the matched nodes it will replace the matching text of the function name with the “Baz”. Regarding code transformations in Java, **Error-Prone** has a similar low-level [patching API](http://errorprone.info/docs/patching) that is very close to native AST manipulation API. It was found to have a steep learning curve similar to the Clang, and thus pose a high entry barrier — even an experienced engineer would need few weeks before one can be productive creating fix suggestions or refactorings. {{% /grid-cell %}} {{% /grid %}} That is why a higher level API was built for Java: first as the separate [Refaster](https://research.google.com/pubs/pub41876.html) project and then [integrated into Error-Prone](http://errorprone.info/docs/refaster) later. So a usual workflow would look like — after running all the checks and emitting a collection of suggested fixes, shard diffs to smaller patches, run all the tests over the changes and if they have passed, submit patches for code review. {{% center %}} … {{% /center %}} {{% center %}} ##### Thank you for reading, stay tuned and keep you codebase healthy! {{% /center %}} ```

campoy commented 5 years ago

Hey Alex, maybe I'm lacking knowledge here but the title doesn't mean anything to me. Could you make it more beginner friendly?

bzz commented 5 years ago

Thank you for feedback, Francesc! It's totally WIP as I'm just gaining confidence in existing tools in this field.

The plan is basically to cover some "state of the art" tools for AST transformation (AKA refactoring), so the learnings could be applied to Bblfsh UAST manipulation API.

How about the title along the lines of "Source code transformations"?

OSS:

Golang: go fix/go fmt -r
Cpp: clang-tidy
C: coccinelle
Java: JTransformer
Example-based refactorings: Java: error-prone/ Golang: eg
Python: Bowler, python google/pasta
Multilanguage https://comby.dev

Proprietary/from talks or papers (material)

ClangMR/JavacFlume
Semmle QL (only query)

campoy commented 5 years ago

Source code transformations makes it much more clear to me, yeah. Let me know when you have a draft of the blog so I can review.

I'd be curious to see if we can make it so the blog doesn't feel like a series of tools, and instead there's a story tying everything up.

bzz commented 5 years ago

I'd be curious to see if we can make it so the blog doesn't feel like a series of tools, and instead there's a story tying everything up.

that is very useful feedback, thank you and please let me think more about that. I would expect that even initial draft will take some time though - but will post it here asap.

Thanks again.

bzz commented 5 years ago

@campoy One story I can think of is:

take simple-but-educational example(s) of some issue in the code as a motivation, and then go though implementing:

a code to detect it
a code to suggest a fix for it in each of those systems.

Due to differences in host languages it could be hard to pick a single example, so it can be adjusted a bit for each specific language, keeping it sufficiently high-level.

A Nice 🍒 on top could be finishing it with the link to a blog post on "how to wrap it as a lookout analyzer" from #249 .

WDYT?

campoy commented 5 years ago

I like it, even if we find an example that only works for a specific language it should be easy to get people from other language communities understand the point of the article.

kuba-- commented 5 years ago

Refactoring prolog code: https://pdfs.semanticscholar.org/b48b/bc30427ef7429db83e190f91a579442121b6.pdf

vcoisne commented 5 years ago

@bzz did you get a chance to start a draft ?

bzz commented 5 years ago

Very preliminary - this is fairly ambitious and requires a lot of research. I would expect a shareble draft early next year.

vcoisne commented 5 years ago

@bzz Trying to plan our blog schedule for the upcoming weeks. Did you get a chance to work on this draft ?

bzz commented 5 years ago

@vcoisne did some progress on research but not there yet. I will ping you as soon as have some results to share!

bzz commented 4 years ago

This is still in my backlog.

Two more interesting contenders added to the description:

https://github.com/google/pasta for python
https://comby.dev for assembly, Bash, C/C++, C#, Clojure, CSS, Dart, Elm, Elixir, Erlang, Fortran, F#, Go, Haskell, HTML/XML, Java, Javascript/Typescript, JSON, Julia, LaTeX, Lisp, OCaml, Pascal, PHP, Python, Ruby, Rust, Scala, SQL, Swift, Text

src-d / blog

[PROPOSAL] Source code transformations #241

Table of contents

Management

Social Media