Large Language Models Play StarCraft II: Benchmarks and A Chain of Summarization Approach

VYY 5IX JX3 H)`N$_B}@L StarCraft II is a challenging benchmark for AI agents due to micro-level operations and macro-awareness. Previous works, such as Alphastar and SCC, achieve impressive performance on tackling StarCraft II , however, still exhibit deficiencies in long-term strategic planning and strategy interpretability. Emerging large language model (LLM) agents, presents the immense potential in solving intricate tasks.

Motivated by this, we aim to validate the capabilities of LLMs on StarCraft II. We first develop textual StratCraft II environment, called TextStarCraft II. Secondly, we propose a Chain of Summarization method, including single-frame summarization for processing raw observations and multi-frame summarization for analyzing game information, providing command recommendations, and generating strategic decisions. Our experiment demonstrates that LLM agents are capable of defeating the built-in AI at the Harder(Lv5) difficulty level.

Work	AlphaStar	SCC	HierNet-SC2	AlphaStar Unplugged	ROA-Star	Ours
Method	SL+RL+self-play	SL+RL+self-play	data-mining + RL	offline RL	SL+RL+self-play	prompt + Rule base script
Compute resource	12000 CPU cores, 384 TPUs	Linear	4 GPUs,48 CPU cores	not clear	2x 64 v100	1 gpu,1 cpu(home computer)
Required replay	971,000	4,638	608	20,000,000(20m)	120938	0
Best result(The greatest opponent ever to win)	Serral(One of the best progamer in the world)	Time(IEM2023 Champion)	build-in ai lv-10	AlphaStar BC agent	hero(GSL Champion)	build-in ai lv-5
Strategy Interpretability	✖	✖	✖	✖	✖	✔
Expansibility(adapt to latest game version and other race )	✖	✖	✖	✖	✖	✔

Our paper:

Large Language Models Play StarCraft II: Benchmarks and A Chain of Summarization Approach https://arxiv.org/abs/2312.11865

Our website:

Large Language Models Play StarCraft II: Benchmarks and A Chain of Summarization Approach

Our demo video:

Large Language Models Play StarCraft II: Benchmarks and A Chain of Summarization Approach

Performance of LLMs in TextStarCraft II

Comparing models using either the full CoS or CoS without CoT.

Model	Method	Win Rate	PBR	RUR	APU	TR
Using Full CoS
GPT3.5-Turbo-16k	Full CoS	5/10	0.0781	7875	0.7608	0.4476
GPT4-Turbo	Full CoS	3/6	0.0337	8306	0.7194	0.3452
Gemini-Pro	Full CoS	2/10	0.0318	9284	0.6611	0.3571
GLM4	Full CoS	2/10	0.0327	3131	0.6644	0.2904
Llama2 70B	Full CoS	/	/	/	/	/
Claude2.1	Full CoS	2/9	0.0219	10867	0.6599	0.4312
Using CoS without CoT
Finetune-ChatGlm3 6b	CoS w/o CoT	2/10	0.0528	30356	0.6547	0.1714
Finetune-Qwen 1.8b	CoS w/o CoT	6/10	0.0384	12826	0.7506	0.2095
Finetune-Qwen 7b	CoS w/o CoT	6/12	0.0421	12276	0.7234	0.3214
Finetune-Llama2 7b	CoS w/o CoT	0/12	0.0469	12295	0.5752	0.0853

Win Rate Comparison of LLM Agents Against TextStarCraft II's Built-in AI

Prompt	LV1	LV2	LV3	LV4	LV5	LV6
Prompt1	7/8	6/9	2/8	1/8	0/8	0/8
Prompt2	8/8	9/9	8/8	21/25	7/14	0/12

Install StarCraft II and setup maps

Install StarCraft II

StatCraft II is a classic game developed by BLZ, and has some professional leagues such as IEM, WTL....You can download Battle.net from:https://us.shop.battle.net/en-us, or here:https://www.blizzard.com/zh-tw/

If you are Chinese, due to the Bobby Kotick, CN play cant own their sever again. So we must download StarCraft II by this video :video or you can search in the internet.

Download maps

First , we should use StarCraft II Editor.exe to download the newest ladder map 217539085-d14f0177-33a4-42f1-ac7d-ac9f61ad29f2

when we open this, please log in your blz account and search the map which you want. 217540537-db80aca9-aec7-4d30-b4f9-f4dc818a1697 Then you should put maps to your StarCrafrt2 file in StarCraft II\Maps(If the 'Maps' file dont exist, please create it).

Or you can download maps in here: 20240301144223

Setup environment

Create environment

OS, We used Windows 11 to develop this demo, because BLZ didnt release the latest sc2 on liunx, so please run our repo on Windows OS!
python: python 3.10.
cuda: cuda 12.1.
torch: 2.1.0
openai: 0.27.9, very important. This is crucial as versions above 0.28 have altered API functionalities. Install all necessary packages with pip install -r requirements.txt.

Tips

burnysc2: This is our core package, offering an easy-to-use API for project development. Find more information here:Python-sc2
chromadb: We utilize the Chroma vector database. Due to package conflicts, install Chromadb first, followed by burnysc2.
Huggingface and sentence-transformers: we used the embedding model sentence-transformers/all-mpnet-base-v2, in our github version, it will automatically download. We also provide the release zip, you can just download and unzip that(with embedding model).

Run demo

Game mode

Agent vs Botai: You can test in test_the_env.py & multiprocess_test.py
Human vs Agent: You can try in our Human_LLM_agent_test.py
Agent vs Agent: You can try in our 2agent_test.py

Single process

You can run test_the_env.py to try our agent. Here is some parameters you need to set.
player_race: Currently, only Protoss is supported. Zerg and Terran are under development.
opposite_race: Typically set to Zerg, but Terran and Protoss are also compatible.
difficulty: We offer 10 difficulty levels, ranging from Level 1 (VeryEasy) to Level 10 (CheatInsane). Note that these names differ from those in the StarCraft2 client, but the AI difficulty remains unchanged.

Level	1	2	3	4	5	6	7	8	9	10
BLZ difficulty	VeryEasy	Easy	Medium	Hard	Harder	Very Hard	Elite	CheatVision	CheatMoney	CheatInsane
python-sc2 difficulty	VeryEasy	Easy	Medium	MediumHard	Hard	Harder	VeryHard	CheatVision	CheatMoney	CheatInsane

replay_folder: Specify the folder for saving demo replays.
LLM_model_name: We used gpt-3.5-turbo-16k in our experiments.
LLM_temperature: Set between 0 and 1 as per your preference.
LLM_api_key: Your API key.
LLM_api_base: Your API base URL.

Note: Using LLM to play StarCraft2 can take approximately 7 hours for a single game.

Multi process

To save time, you can run multiple demos simultaneously using multiprocess_test.py. Configure the following parameter:

num_processes: The number of processes to spawn.

Other parameters are the same as in the Single Process setup.

Other settings

In our experiments, we have added some more settings, but due to several reasons these settings will coming soon.

num_agents : This environment will support LLM agent vs LLM agent or RL agent.
env_type: This environment will support Text or MultiModal
'player_race': This environment will support Zerg and Terran
opposite_type: This env will support some human designed botai.

Create your LLM Agent

If you want to use other llm to create your own llm agent, the following things you should to know.

Component of LLM Agent

LLM: In our repo, you should request llm from ChatBot_SingleTurn function in TextStarCraft2_2/LLM/gpt_test
'L1_summarize': Our level-1 summarization method is here: generate_summarize_L1 in TextStarCraft2_2/summarize/L1_summarize.py
L2_summarize: Our level-2 summarization method is here : L2_summary in TextStarCraft2_2/summarize/gpt_test/L2_summarize.py
action dict: The actions that llm agent can use. Here we can set TextStarCraft2_2/utils/action_info.py . You can add more actions for llm agent.
action extractor : We can extract decisions by TextStarCraft2_2/utils/action_extractor.py

Env

The core of our TextStarCraft II env is TextStarCraft2_2/env/bot. Here you can add more settings for environment. So if you want to realise Terran and Zerg bot, you can modify our code about this dictionary.

State: In Protoss_bot.py, the State of Env is generate from get_information function. This is what we said Obs to Text adaptor
Action: In Protoss_bot.py, the Action space of Agent is designed by these handle_action function. This is what we said Text to Action adaptor.

Support Models

We have tested several LLMs in our experiments. The usage is in sc2_rl_agent/starcraftenv_test/LLM file

Online LLM: GPT3.5-tubor,GLM4,Gemini-pro,Claude2.
Local LLM: GLM3,QWEN,QWEN1.5.

Evaluation Metrics Overview

Our framework in TextStarCraft II extends traditional StarCraft II analytics to evaluate LLM agents’ strategies with metrics tailored for AI gameplay performance:

Win Rate: Reflects the agent's performance, calculated as the percentage of games won out of total games played.
Population Block Ratio (PBR): Indicates macro-management effectiveness, focusing on resource allocation and population growth. A higher PBR suggests less effective macro-strategy due to more time spent at population cap.
Resource Utilization Ratio (RUR): Measures how efficiently the agent manages resources throughout the game. Higher RUR indicates underutilization of resources.
Average Population Utilization (APU): Assesses efficiency in utilizing population capacity. Higher APU indicates better macro-management.
Technology Rate (TR): Evaluates the agent's use of the technology tree, showing the proportion of technologies and buildings completed. It reflects the agent’s technological advancement.

histmeisah / Large-Language-Models-play-StarCraftII

readme

Large Language Models Play StarCraft II: Benchmarks and A Chain of Summarization Approach

Performance of LLMs in TextStarCraft II

Win Rate Comparison of LLM Agents Against TextStarCraft II's Built-in AI

Install StarCraft II and setup maps

Install StarCraft II

Download maps

Setup environment

Create environment

Tips

Run demo

Game mode

Single process

Multi process

Other settings

Create your LLM Agent

Component of LLM Agent

Env

Support Models

Evaluation Metrics Overview