open-source-ideas / ideas

💡 Looking for inspiration for your next open source project? Or perhaps you've got a brilliant idea you can't wait to share with others? Open Source Ideas is a community built specifically for this! 👋
6.59k stars 221 forks source link

A neural source code (re)formatter #289

Open KOLANICH opened 3 years ago

KOLANICH commented 3 years ago

Project description

clang-format feels like a piece of shit. It is not fully adjustable. I.e. I haven't found any way to configure several aspects of formatting, such as disabling inserting a space in between of ) and {. So we need a reformater that

  1. Would be fully configurable
  2. Would be able to infer formatting style from examples and store it and reuse later.
  3. Would be introspectable.
  4. Would not require much human effort to encode the ways of formatting source code.

It is proposed to use a neural network for that.

  1. Source code is parsed into a CST (concrete syntax tree, a tree that combines the high-level info of AST and low-level info of parse tree) with some tool;
  2. CST is walked and transformed into a sequence of tokens, keeping the associated info - token kind;
  3. The sequence is postprocessed.
    1. In the sequence of tokens each whitespace (space, tab) is converted into a own token type.
    2. For user-customizeable tokens (identifiers, literals - the ones where a user can type an own-customizeable component; we call that component a payload) a sequence number is assigned. Payload is moved into an array.
    3. All the tokens types are encoded in a semantically-meaningful way into vectors of features. A vector should contain at least the following info: if the token user-customizeable, if a whitespace is required after that token, if a whitespace is required before that token.
    4. The sequence of parser-specific tokens is converted into 2 sequences:
      • a sequence of token types
      • a sequence of payloads
  4. A sequence encoding the meaning is created. It is an algorithmically normalized original sequence:
    1. All whitespaces are stripped
    2. All the stuff that encodes meaning but in some cases may be skipped is inserted
  5. A neural model encoding and decoding style is trained on the sequence of raw token types.
    1. The encoder:
      1. the input is a sequence of token types in the original source.
      2. the output is a normalised sequence and the style vector.
    2. The decoder:
      1. the input is the style vector and the normalised sequence.
      2. the output is a sequence of token types.
    3. The loss is the sum of the following components:
      1. a loss between the original sequence and the output sequence of the decoder.
      2. a loss between the normalised sequence output of the encoder and the algorithmically-normalised output.
      3. L1 regularization on the feature vector
      4. ICA on the feature vector

Relevant Technology

Complexity and required time

Complexity

Required time (ETA)

Categories