A neural source code (re)formatter

Project description

clang-format feels like a piece of shit. It is not fully adjustable. I.e. I haven't found any way to configure several aspects of formatting, such as disabling inserting a space in between of ) and {. So we need a reformater that

Would be fully configurable
Would be able to infer formatting style from examples and store it and reuse later.
Would be introspectable.
Would not require much human effort to encode the ways of formatting source code.

It is proposed to use a neural network for that.

Source code is parsed into a CST (concrete syntax tree, a tree that combines the high-level info of AST and low-level info of parse tree) with some tool;
CST is walked and transformed into a sequence of tokens, keeping the associated info - token kind;
The sequence is postprocessed.
1. In the sequence of tokens each whitespace (space, tab) is converted into a own token type.
2. For user-customizeable tokens (identifiers, literals - the ones where a user can type an own-customizeable component; we call that component a payload) a sequence number is assigned. Payload is moved into an array.
3. All the tokens types are encoded in a semantically-meaningful way into vectors of features. A vector should contain at least the following info: if the token user-customizeable, if a whitespace is required after that token, if a whitespace is required before that token.
4. The sequence of parser-specific tokens is converted into 2 sequences:
  - a sequence of token types
  - a sequence of payloads
A sequence encoding the meaning is created. It is an algorithmically normalized original sequence:
1. All whitespaces are stripped
2. All the stuff that encodes meaning but in some cases may be skipped is inserted
A neural model encoding and decoding style is trained on the sequence of raw token types.
1. The encoder:
  1. the input is a sequence of token types in the original source.
  2. the output is a normalised sequence and the style vector.
2. The decoder:
  1. the input is the style vector and the normalised sequence.
  2. the output is a sequence of token types.
3. The loss is the sum of the following components:
  1. a loss between the original sequence and the output sequence of the decoder.
  2. a loss between the normalised sequence output of the encoder and the algorithmically-normalised output.
  3. L1 regularization on the feature vector
  4. ICA on the feature vector

Relevant Technology

ICA and other linear algebra
neural networks, seq-2-seq and transformers
AST

Complexity and required time

Complexity

[X] Advanced - The project requires the user to have a good understanding of all components of the project to contribute

Required time (ETA)

[X] Much work - The project will take more than a couple of weeks and serious planning is required

open-source-ideas / ideas

A neural source code (re)formatter #289

Project description

Relevant Technology

Complexity and required time

Complexity

Required time (ETA)

Categories