This repository contains the code for the Distinguishing AI and Human-Generated Code: a Case Study which is a Code Stylometry tool that parses C/C++ code into an Abstract Syntax Tree, computes n-gram frequencies/counts on the AST, extracts lexical features from the codefiles and uses those as feature vectors for classification tasks using traditional machine learning binary classifiers.
pip install -r requirements.txt
cd ASTanalysis
mdkir vendor && cd vendor
git clone https://github.com/tree-sitter/tree-sitter-cpp
cd ..
python builder.py
# my-languages.so* file will be created under the folder ASTanalysis/build
docker run -d -p 127.0.0.1:27017:27017 -v <path-to>/db:/data/db mongo:6
This file parses each C/C++ code files available in the "Dataset" folder in to its Abstract Syntax Trees and saves it in a XML format to a destination folder and in a collection in MongoDB database.
This file executes the ASTParser.py file that loops through the dataset.
This will create pickle files with all the extracted node types from the ASTs for each code file in the dataset folder. It will also create a text file with all the node types found in all the code files in dataset folder.
This file creates the empty dictionaries for bi,tri and quad grams and save them to a pickle file at the desired destination folder.
This file creates a dictionary with the "Key" values as all the node types found within the code and values "Values" of those keys are the generalized node types. You can change the generalized node types from within this file.
This will update the empty dictionary for each code file in source dataset based on the n-gram value (2,3,4).
These files will create the normalized vectors for the each file that was created by the (Bigram/Trigram/Quadgram)DictUpdate.py files.
These files will apply the stratified binary classification using RFC, KNN, XGB & SVM on the created normalized vectors for syntactic features.
This file loops through the source dataset files and executes the LexicalParse.py for each file.
This file parses the source code file and creates features based on the lexical elements within the file.
This file will apply the stratified binary classification using RFC, KNN, XGB & SVM on the created normalized vectors for lexical features.
This file loops through the source dataset files and executes the ASTandLexicalParse.py
This file parses the source code files in the syntactic and lexical features and save those vectors in MongoDB.
This file applies the stratified binary classification using RFC, KNN, XGB & SVM on the created normalized vectors for syntactic and lexical features.
Run the following command, which will execute the ASTParse.py file
python3 ASTLooper.py /absolute/path/to/ASTanalysis/Dataset absolute/path/of/destination/folder
By providing two absolute paths of the below as inputs:
This will have 3 outputs:
Follow these steps to double check MongoDB:
docker ps
docker exec -it CONTAINER_NAME_OR_ID bash
. You are currently inside the container, run the following commands to inspect the DB:
mongosh
use admin
show dbs # you should see the created DB CodeStylometry
use CodeStylometry
show collections
Takes a destination folder path as an input argument. Saves the node types in pickle file in the destination folder. Create a new folder NodeType in the Processed folder (see prerequisites)
To run the file:
python3 CreateNodeTypeSet.py /path/to/NodeType
The following files/folders Autopilot,buffer.txt,Control,finalNodeTypes.txt
will be created inside the directory NodeType
Run the file by providing destination folder path. Create a new folder EmptyDictionaries in the Processed folder and provide the path to this new folder as input argument.
To run the file:
python3 NGramEmptyPair.py /path/to/EmptyDictionaries
Outputs 3 empty dictionary files (dictOfBigram.pickle, dictofQuadgram.pickle, dictofTrigram.pickle
) for each bi, tri and quad combination
Takes a destination folder path as an input argument. Create the directory DictOfNodes inside Processed
To run the file:
python3 DictOfNodes.py /path/to/DictOfNodes
The file DictOfNodes.pickle
will be created inside the directory DictOfNodes
Takes 2 inputs to create a Bigram Dictionary with frequencies of each pair combination in the AST. (i) path to root folder that contains the ASTs created in step # 3 (i.e., NodeType) (ii) path to destination folder
Need to change on line 99 pathToEmptyDict
variable in the code and provide the absolute path to the empty dictionary pickle that was created for bigram dictOfBigram.pickle
Need to change on line 116 DictOfNodes.py
path in the code.
Create a new folder ASTDictionaries in the Processed folder and give it as an input argument
To run the file:
python3 BigramDictUpdate.py /path/to/NodeType /path/to/ASTDictionaries
This will output folder named Bigram containing pickle files with updated frequency for each pair combination
Repeat the same steps as in step # 6.
Provide the path to Trigram Dictionary dictOfTrigram.pickle
This will output folder named Trigram containing pickle files with updated frequency for each pair combination
Repeat the same steps as in step # 6.
Provide the path to Trigram Dictionary dictOfQuadgram.pickle
This will output folder named Trigram containing pickle files with updated frequency for each pair combination
This will execute the ASTandLexParse.py file.
Run the the file by providing two absolute paths of the below as inputs:
To run the file:
python3 ASTandLexLooper.py /absolute/path/of/dataset/folder absolute/path/of/dictionary
Based on the dictionary input file it will create a new Collection in MongoDB. Collection will be named as CombinedVectorWith<Bigram/Trigram/Quadgram>
Run the file again for N-gram length 3 and 4
Uncomment the appropriate line from line# 27, 28 or 29 to retrieve the vectors from the required MongoDB collection.
# mycol = mydb["CombinedVectorWithBigram"]
# mycol = mydb["CombinedVectorWithTrigram"]
# mycol = mydb["CombinedVectorWithQuadgram"]
To run the file:
python3 ASTandLexClassification.py
Takes an argument as input to create the normalized vectors for Bigram combination. The path to Bigram source AST folder that was created in step # 6 is the input
This will output the vector and save in MongoDB under ASTBigramVector
collection
To run the file:
python3 BigramVectorCreation.py /path/to/bigram/AST/folder
Repeat same steps as in step # 9
Provide the path to Trigram source AST folder
This will output the vector and save in MongoDB under ASTTrigramVector
Repeat same steps as in step # 9
Provide the path to Quadgram source AST folder
This will output the vector and save in MongoDB under ASTQuadgramVector
This will retrieve data from MongoDB Bigram Vector collection and run Classification models on the feature vectors.
To run the file:
python3 BiStratClassification.py
To run the file:
python3 TriStratClassification.py
To run the file:
python3 QuadStratClassification.py
Run the file providing the source path to the Dataset
folder
To run the file:
python3 LexicalLooper.py /absolute/path/of/dataset/folder
This will execute the LexicalParse.py file create a collection named LexicalVectors in MongoDB