secure-software-engineering / TypeEvalPy

A Micro-benchmarking Framework for Python Type Inference Tools
26 stars 3 forks source link
benchmark python staticanalysis typeinference


A Micro-benchmarking Framework for Python Type Inference Tools

📌 Features:

[New] TypeEvalPy Autogen

🛠️ Supported Tools

Supported :white_check_mark: In-progress :wrench: Planned :bulb:
HeaderGen Intellij PSI MonkeyType
Jedi Pyre Pyannotate
Pyright PySonar2
HiTyper Pytype
Scalpel TypeT5
Type4Py
GPT-4
Ollama


🏆 TypeEvalPy Leaderboard

Below is a comparison showcasing exact matches across different tools, coupled with top_n predictions for ML-based tools.

Rank 🛠️ Tool Top-n Function Return Type Function Parameter Type Local Variable Type Total
1 HeaderGen 1 186 56 322 564
2 Jedi 1 122 0 293 415
3 Pyright 1 100 8 297 405
4 HiTyper 1
3
5
163
173
175
27
37
37
179
225
229
369
435
441
5 HiTyper (static) 1 141 7 102 250
6 Scalpel 1 155 32 6 193
7 Type4Py 1
3
5
39
103
109
19
31
31
99
167
174
157
301
314

(Auto-generated based on the the analysis run on 20 Oct 2023)


🏆🤖 TypeEvalPy LLM Leaderboard

Below is a comparison showcasing exact matches for LLMs.

Rank 🛠️ Tool Function Return Type Function Parameter Type Local Variable Type Total
1 GPT-4 225 85 465 775
2 Finetuned:GPT 3.5 209 85 436 730
3 codellama:13b-instruct 199 75 425 699
4 GPT 3.5 Turbo 188 73 429 690
5 codellama:34b-instruct 190 52 425 667
6 phind-codellama:34b-v2 182 60 399 641
7 codellama:7b-instruct 171 72 384 627
8 dolphin-mistral 184 76 356 616
9 codebooga 186 56 354 596
10 llama2:70b 168 55 342 565
11 HeaderGen 186 56 321 563
12 wizardcoder:13b-python 170 74 317 561
13 llama2:13b 153 40 283 476
14 mistral:instruct 155 45 250 450
15 mistral:v0.2 155 45 248 448
16 vicuna:13b 153 35 260 448
17 vicuna:33b 133 29 267 429
18 Jedi 122 0 293 415
19 Pyright 100 8 297 405
19 wizardcoder:7b-python 103 48 254 405
20 llama2:7b 140 34 216 390
21 HiTyper 163 27 179 369
22 wizardcoder:34b-python 140 43 178 361
23 orca2:7b 117 27 184 328
24 vicuna:7b 131 17 172 320
25 orca2:13b 113 19 166 298
26 Scalpel 155 32 6 193
27 Type4Py 39 19 99 157
28 tinyllama 3 0 23 26
29 phind-codellama:34b-python 5 0 15 20
30 codellama:13b-python 0 0 0 0
31 codellama:34b-python 0 0 0 0
32 codellama:7b-python 0 0 0 0

(Auto-generated based on the the analysis run on 14 Jan 2024)


:whale: Running with Docker

1️⃣ Clone the repo

git clone https://github.com/secure-software-engineering/TypeEvalPy.git

2️⃣ Build Docker image

docker build -t typeevalpy .

3️⃣ Run TypeEvalPy

🕒 Takes about 30mins on first run to build Docker containers.

📂 Results will be generated in the results folder within the root directory of the repository. Each results folder will have a timestamp, allowing you to easily track and compare different runs.

Correlation of CSV Files Generated to Tables in ICSE Paper Here is how the auto-generated CSV tables relate to the paper's tables: - **Table 1** in the paper is derived from three auto-generated CSV tables: - `paper_table_1.csv` - details Exact matches by type category. - `paper_table_2.csv` - lists Exact matches for 18 micro-benchmark categories. - `paper_table_3.csv` - provides Sound and Complete values for tools. - **Table 2** in the paper is based on the following CSV table: - `paper_table_5.csv` - shows Exact matches with top_n values for machine learning tools. Additionally, there are CSV tables that are *not* included in the paper: - `paper_table_4.csv` - containing Sound and Complete values for 18 micro-benchmark categories. - `paper_table_6.csv` - featuring Sensitivity analysis.
docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy

🔧 Optionally, run analysis on specific tools:

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy --runners headergen scalpel

🛠️ Available options: headergen, pyright, scalpel, jedi, hityper, type4py, hityperdl

🤖 Running TypeEvalPy with LLMs

TypeEvalPy integrates with LLMs through Ollama, streamlining their management. Begin by setting up your environment:

In the config.yaml, configure in the following:

With the config.yaml configured, run the following command:

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy --runners ollama

Running From Source... ## 1. 📥 Installation 1. **Clone the repo** ```bash git clone https://github.com/secure-software-engineering/TypeEvalPy.git ``` 2. **Install Dependencies and Set Up Virtual Environment** Run the following commands to set up your virtual environment and activate the virtual environment. ```bash python3 -m venv .env ``` ```bash source .env/bin/activate ``` ```bash pip install -r requirements.txt ``` --- ## 2. 🚀 Usage: Running the Analysis 1. **Navigate to the `src` Directory** ```bash cd src ``` 2. **Execute the Analyzer** Run the following command to start the benchmarking process on all tools: ```bash python main_runner.py ``` or Run analysis on specific tools ``` python main_runner.py --runners headergen scalpel ```

Running TypeEvalPy Autogen

To generate an extended version of the original TypeEvalPy benchmark to include many more Python types, run the following commands:

  1. Navigate to the autogen Directory

    cd autogen
  2. Execute the Generation Script

    Run the following command to start the generation process:

    python generate_typeevalpy_dataset.py

This will generate a folder in the repo root with the autogen benchmark with the current date.


🤝 Contributing

Thank you for your interest in contributing! To add support for a new tool, please utilize the Docker templates provided in our repository. After implementing and testing your tool, please submit a pull request (PR) with a descriptive message. Our maintainers will review your submission, and merge them.

To get started with integrating your tool, please follow the guide here: docs/Tool_Integration_Guide.md


⭐️ Show Your Support

Give a ⭐️ if this project helped you!