ldklab / scored23_release

2 stars 2 forks source link

Distinguishing AI and Human-Generated Code: a Case Study

This repository contains the code for the Distinguishing AI and Human-Generated Code: a Case Study which is a Code Stylometry tool that parses C/C++ code into an Abstract Syntax Tree, computes n-gram frequencies/counts on the AST, extracts lexical features from the codefiles and uses those as feature vectors for classification tasks using traditional machine learning binary classifiers.

Features:

  1. Parsing C/C++ code into ASTs
  2. Compute N-gram vectors
  3. N-gram dictionaries
  4. Binary Classification based on Syntactic features
  5. Binary Classificaiton based on Lexical features
  6. Binary Classification based on Syntactic and Lexical (combined) features

Prerequisites

  1. Download required packages using pip install -r requirements.txt
  2. Clone and build "TreeSitter" package using the following steps
    cd ASTanalysis
    mdkir vendor && cd vendor
    git clone https://github.com/tree-sitter/tree-sitter-cpp
    cd ..
    python builder.py 
    # my-languages.so* file will be created under the folder ASTanalysis/build
  3. Create a new folder "Processed" in the same directory as the "ASTAnalysis" folder.
  4. Download and install MongoDB local instance. docker run -d -p 127.0.0.1:27017:27017 -v <path-to>/db:/data/db mongo:6
  5. Download the ASTAnalysis folder. This contains the code files and the "Dataset" folder.

Files:

1. ASTParser.py

This file parses each C/C++ code files available in the "Dataset" folder in to its Abstract Syntax Trees and saves it in a XML format to a destination folder and in a collection in MongoDB database.

2. ASTLooper.py

This file executes the ASTParser.py file that loops through the dataset.

3. CreateNodeTypeSet.py

This will create pickle files with all the extracted node types from the ASTs for each code file in the dataset folder. It will also create a text file with all the node types found in all the code files in dataset folder.

4. NGramEmptyPair.py

This file creates the empty dictionaries for bi,tri and quad grams and save them to a pickle file at the desired destination folder.

5. DictOfNodes.py

This file creates a dictionary with the "Key" values as all the node types found within the code and values "Values" of those keys are the generalized node types. You can change the generalized node types from within this file.

6. BigramDictUpdate.py & TrigramDictUpdate.py & QuadgramDictUpdate.py

This will update the empty dictionary for each code file in source dataset based on the n-gram value (2,3,4).

7. BigramVectorCreation.py & TrigramVectorCreation.py & QuadgramVectorCreation.py

These files will create the normalized vectors for the each file that was created by the (Bigram/Trigram/Quadgram)DictUpdate.py files.

8. BiStratClassification.py & TriStratClassification.py & QuadStratClassification.py

These files will apply the stratified binary classification using RFC, KNN, XGB & SVM on the created normalized vectors for syntactic features.

9. LexicalLooper.py

This file loops through the source dataset files and executes the LexicalParse.py for each file.

10. LeixcalParse.py

This file parses the source code file and creates features based on the lexical elements within the file.

11. LexicalClassification.py

This file will apply the stratified binary classification using RFC, KNN, XGB & SVM on the created normalized vectors for lexical features.

11. ASTandLexicalLooper.py

This file loops through the source dataset files and executes the ASTandLexicalParse.py

12. ASTandLexicalParse.py

This file parses the source code files in the syntactic and lexical features and save those vectors in MongoDB.

13. ASTandLexicalClassification.py

This file applies the stratified binary classification using RFC, KNN, XGB & SVM on the created normalized vectors for syntactic and lexical features.

How To Run:

1. ASTParser.py

2. ASTLooper.py

3. CreateNodeTypeSet.py

4. NGramEmptyPair.py

5. DictOfNodes.py

6. BigramDictUpdate.py

7. TrigramDictUpdate.py

8. QuadgramDictUpdate.py

Follow Below Steps For Classification Based on Syntactic + Lexical Features After Step # 8:

9. ASTandLexParse.py

10. ASTandLexLooper.py

11. ASTandLexClassification.py

Follow Below Steps For Classification Based on Syntactic Features Only After Step # 8:

9. BigramVectorCreation.py

10. TrigramVectorCreation.py

11. QuadgramVectorCreation.py

12. BiStratClassification.py

13. TriStratClassification.py

To run the file:

    python3 TriStratClassification.py

14. QuadStratClassification.py

To run the file:

    python3 QuadStratClassification.py

Follow Below Steps For Classification Based on Lexical Features Only (No Need To Run Steps # 1 - 8):

1. LexicalParse.py

2. LexicalLooper.py

3. LexicalStratClassification.py

External Tools:

  1. TreeSitter
  2. NetworkX
  3. MongoDB