Purpose

The purpose of this project is to use supervised machine learning techniques to determine features that are most deterministic of authorship of abap code.

Code authorship identification is not a new problem domain in machine learning. Techniques and experiments have been designed to extract relevant style and structure features of Java, C-languages, and Python code. This article provides an overview of the problem solving methodology applied in various experiments over the years.

From the article:

Many software products that solve problems associated with the style of writing code are based on the use of various methods of machine learning. Traditional methodology for obtaining software for the task, used in this area, usually involves the following steps:

Extracting software metrics that could define an author’s style

Filtering metrics and highlighting the really significant ones

Choosing a machine learning model for classifying and training the model using selected metrics

The application of the model is based on the selection of an already filtered set of metrics.

(p.3 of pdf)

Methodology

Methods involve extracting features from input "text" (i.e. the code).

This project attempts to explore feature extraction on:

Embeddings on raw abap code
Concrete Syntax Trees derived from a chevrotain-based parser

Embeddings on Raw Text

I will be mimicking logic described in this article, which uses TF-IDF embeddings on text lists of ingredients to determine cuisine.

An alternative to TF-IDF is Word2Vec. Both of these tools provide ways to structure raw text into vectorized inputs to a data model. These methods can be used to determine relationships between and among words. Hypothetically, I could use the vectorized relationships as input features to a model to classify a program file's author.

This article provides an introductory tutorial to TF-IDF
And this describes Word2Vec

Concrete Syntax Trees

In an attempt to parse raw text into syntax trees, I am exploring a tool called chevrotain, a javascript-based parsing toolkit. The thought behind exploring this avenue as opposed to the word embeddings is that while ABAP is very English-like in grammar, it is a programming language. Syntax trees can abstract the structural elements of an ABAP program, which can be collected, measured, and provided as input features to a classifier network.

Initial work can be found in the branch chevrotain-tokens

txross1993 / abap-authorship-classifier

readme

Purpose

Methodology

Embeddings on Raw Text

Concrete Syntax Trees