secure-software-engineering/TypeEvalPy

A Micro-benchmarking Framework for Python Type Inference Tools

📌 Features:

📜 Contains 154 code snippets to test and benchmark.
🏷 Offers 845 type annotations across a diverse set of Python functionalities.
📂 Organized into 18 distinct categories targeting various Python features.
🚢 Seamlessly manages the execution of containerized tools.
🔄 Efficiently transforms inferred types into a standardized format.
📊 Automatically produces meaningful metrics for in-depth assessment and comparison.

[New] TypeEvalPy Autogen

🤖 Autogenerates code snippets and ground truth to scale the benchmark based on the original TypeEvalPy benchmark.
📈 The autogen benchmark now contains:
- Python files: 7121
- Type annotations: 78373

🛠️ Supported Tools

Supported :white_check_mark:	In-progress :wrench:	Planned :bulb:
HeaderGen	Intellij PSI	MonkeyType
Jedi	Pyre	Pyannotate
Pyright	PySonar2
HiTyper	Pytype
Scalpel	TypeT5
Type4Py
GPT-4
Ollama

🏆 TypeEvalPy Leaderboard

Below is a comparison showcasing exact matches across different tools, coupled with top_n predictions for ML-based tools.

Rank	🛠️ Tool	Top-n	Function Return Type	Function Parameter Type	Local Variable Type	Total
1	HeaderGen	1	186	56	322	564
2	Jedi	1	122	0	293	415
3	Pyright	1	100	8	297	405
4	HiTyper	1 3 5	163 173 175	27 37 37	179 225 229	369 435 441
5	HiTyper (static)	1	141	7	102	250
6	Scalpel	1	155	32	6	193
7	Type4Py	1 3 5	39 103 109	19 31 31	99 167 174	157 301 314

_{(Auto-generated based on the the analysis run on 20 Oct 2023)}

🏆🤖 TypeEvalPy LLM Leaderboard

Below is a comparison showcasing exact matches for LLMs.

Rank	🛠️ Tool	Function Return Type	Function Parameter Type	Local Variable Type	Total
1	GPT-4	225	85	465	775
2	Finetuned:GPT 3.5	209	85	436	730
3	codellama:13b-instruct	199	75	425	699
4	GPT 3.5 Turbo	188	73	429	690
5	codellama:34b-instruct	190	52	425	667
6	phind-codellama:34b-v2	182	60	399	641
7	codellama:7b-instruct	171	72	384	627
8	dolphin-mistral	184	76	356	616
9	codebooga	186	56	354	596
10	llama2:70b	168	55	342	565
11	HeaderGen	186	56	321	563
12	wizardcoder:13b-python	170	74	317	561
13	llama2:13b	153	40	283	476
14	mistral:instruct	155	45	250	450
15	mistral:v0.2	155	45	248	448
16	vicuna:13b	153	35	260	448
17	vicuna:33b	133	29	267	429
18	Jedi	122	0	293	415
19	Pyright	100	8	297	405
19	wizardcoder:7b-python	103	48	254	405
20	llama2:7b	140	34	216	390
21	HiTyper	163	27	179	369
22	wizardcoder:34b-python	140	43	178	361
23	orca2:7b	117	27	184	328
24	vicuna:7b	131	17	172	320
25	orca2:13b	113	19	166	298
26	Scalpel	155	32	6	193
27	Type4Py	39	19	99	157
28	tinyllama	3	0	23	26
29	phind-codellama:34b-python	5	0	15	20
30	codellama:13b-python	0	0	0	0
31	codellama:34b-python	0	0	0	0
32	codellama:7b-python	0	0	0	0

_{(Auto-generated based on the the analysis run on 14 Jan 2024)}

:whale: Running with Docker

1️⃣ Clone the repo

git clone https://github.com/secure-software-engineering/TypeEvalPy.git

2️⃣ Build Docker image

docker build -t typeevalpy .

3️⃣ Run TypeEvalPy

🕒 Takes about 30mins on first run to build Docker containers.

📂 Results will be generated in the results folder within the root directory of the repository. Each results folder will have a timestamp, allowing you to easily track and compare different runs.

Correlation of CSV Files Generated to Tables in ICSE Paper

Here is how the auto-generated CSV tables relate to the paper's tables: - **Table 1** in the paper is derived from three auto-generated CSV tables: - `paper_table_1.csv` - details Exact matches by type category. - `paper_table_2.csv` - lists Exact matches for 18 micro-benchmark categories. - `paper_table_3.csv` - provides Sound and Complete values for tools. - **Table 2** in the paper is based on the following CSV table: - `paper_table_5.csv` - shows Exact matches with top_n values for machine learning tools. Additionally, there are CSV tables that are *not* included in the paper: - `paper_table_4.csv` - containing Sound and Complete values for 18 micro-benchmark categories. - `paper_table_6.csv` - featuring Sensitivity analysis.

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy

🔧 Optionally, run analysis on specific tools:

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy --runners headergen scalpel

🛠️ Available options: headergen, pyright, scalpel, jedi, hityper, type4py, hityperdl

🤖 Running TypeEvalPy with LLMs

TypeEvalPy integrates with LLMs through Ollama, streamlining their management. Begin by setting up your environment:

Create Configuration File: Copy the config_template.yaml from the src directory and rename it to config.yaml.

In the config.yaml, configure in the following:

openai_key: your key for accessing OpenAI's models.
ollama_url: the URL for your Ollama instance. For simplicity, we recommend deploying Ollama using their Docker container. Get started with Ollama here.
prompt_id: set this to questions_based_2 for optimal performance, based on our tests.
ollama_models: select a list of model tags from the Ollama library. For better operation, ensure the model is pre-downloaded with the ollama pull command.

With the config.yaml configured, run the following command:

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy --runners ollama

Running From Source...

## 1. 📥 Installation 1. **Clone the repo** ```bash git clone https://github.com/secure-software-engineering/TypeEvalPy.git ``` 2. **Install Dependencies and Set Up Virtual Environment** Run the following commands to set up your virtual environment and activate the virtual environment. ```bash python3 -m venv .env ``` ```bash source .env/bin/activate ``` ```bash pip install -r requirements.txt ``` --- ## 2. 🚀 Usage: Running the Analysis 1. **Navigate to the `src` Directory** ```bash cd src ``` 2. **Execute the Analyzer** Run the following command to start the benchmarking process on all tools: ```bash python main_runner.py ``` or Run analysis on specific tools ``` python main_runner.py --runners headergen scalpel ```

Running TypeEvalPy Autogen

To generate an extended version of the original TypeEvalPy benchmark to include many more Python types, run the following commands:

Navigate to the autogen Directory
```
cd autogen
```
Execute the Generation Script

Run the following command to start the generation process:
```
python generate_typeevalpy_dataset.py
```

This will generate a folder in the repo root with the autogen benchmark with the current date.

🤝 Contributing

Thank you for your interest in contributing! To add support for a new tool, please utilize the Docker templates provided in our repository. After implementing and testing your tool, please submit a pull request (PR) with a descriptive message. Our maintainers will review your submission, and merge them.

To get started with integrating your tool, please follow the guide here: docs/Tool_Integration_Guide.md

⭐️ Show Your Support

Give a ⭐️ if this project helped you!