| Research preview | Paper | Website |
Latest News 🔥
pip install knowledge-storm
!VectorRM
to support grounding on user-provided documents, complementing existing support of search engines (YouRM
, BingSearch
). (check out #58)GPT-4o
- we now configure the article generation part in our demo using GPT-4o
model.src/storm_wiki
) to demonstrate how to instantiate the pipeline. We provide API to support customization of different language models and retrieval/search integration.
STORM is a LLM system that writes Wikipedia-like articles from scratch based on Internet search.
While the system cannot produce publication-ready articles that often require a significant number of edits, experienced Wikipedia editors have found it helpful in their pre-writing stage.
Try out our live research preview to see how STORM can help your knowledge exploration journey and please provide feedback to help us improve the system 🙏!
STORM breaks down generating long articles with citations into two steps:
STORM identifies the core of automating the research process as automatically coming up with good questions to ask. Directly prompting the language model to ask questions does not work well. To improve the depth and breadth of the questions, STORM adopts two strategies:
Based on the separation of the two stages, STORM is implemented in a highly modular way using dspy.
To install the knowledge storm library, use pip install knowledge-storm
.
You could also install the source code which allows you to modify the behavior of STORM engine directly.
Clone the git repository.
git clone https://github.com/stanford-oval/storm.git
cd storm
Install the required packages.
conda create -n storm python=3.11
conda activate storm
pip install -r requirements.txt
The STORM knowledge curation engine is defined as a simple Python STORMWikiRunner
class.
As STORM is working in the information curation layer, you need to set up the information retrieval module and language model module to create a STORMWikiRunner
instance. Here is an example of using You.com search engine and OpenAI models.
import os
from knowledge_storm import STORMWikiRunnerArguments, STORMWikiRunner, STORMWikiLMConfigs
from knowledge_storm.lm import OpenAIModel
from knowledge_storm.rm import YouRM
lm_configs = STORMWikiLMConfigs()
openai_kwargs = {
'api_key': os.getenv("OPENAI_API_KEY"),
'temperature': 1.0,
'top_p': 0.9,
}
# STORM is a LM system so different components can be powered by different models to reach a good balance between cost and quality.
# For a good practice, choose a cheaper/faster model for `conv_simulator_lm` which is used to split queries, synthesize answers in the conversation.
# Choose a more powerful model for `article_gen_lm` to generate verifiable text with citations.
gpt_35 = OpenAIModel(model='gpt-3.5-turbo', max_tokens=500, **openai_kwargs)
gpt_4 = OpenAIModel(model='gpt-4-o', max_tokens=3000, **openai_kwargs)
lm_configs.set_conv_simulator_lm(gpt_35)
lm_configs.set_question_asker_lm(gpt_35)
lm_configs.set_outline_gen_lm(gpt_4)
lm_configs.set_article_gen_lm(gpt_4)
lm_configs.set_article_polish_lm(gpt_4)
# Check out the STORMWikiRunnerArguments class for more configurations.
engine_args = STORMWikiRunnerArguments(...)
rm = YouRM(ydc_api_key=os.getenv('YDC_API_KEY'), k=engine_args.search_top_k)
runner = STORMWikiRunner(engine_args, lm_configs, rm)
Currently, our package support:
OpenAIModel
, AzureOpenAIModel
, ClaudeModel
, VLLMClient
, TGIClient
, TogetherClient
, OllamaClient
as language model componentsYouRM
, BingSearch
, VectorRM
as retrieval module components:star2: PRs for integrating more language models into knowledge_storm/lm.py and search engines/retrievers into knowledge_storm/rm.py are highly appreciated!
The STORMWikiRunner
instance can be evoked with the simple run
method:
topic = input('Topic: ')
runner.run(
topic=topic,
do_research=True,
do_generate_outline=True,
do_generate_article=True,
do_polish_article=True,
)
runner.post_run()
runner.summary()
do_research
: if True, simulate conversations with difference perspectives to collect information about the topic; otherwise, load the results.do_generate_outline
: if True, generate an outline for the topic; otherwise, load the results.do_generate_article
: if True, generate an article for the topic based on the outline and the collected information; otherwise, load the results.do_polish_article
: if True, polish the article by adding a summarization section and (optionally) removing duplicate content; otherwise, load the results.We provide scripts in our examples folder as a quick start to run STORM with different configurations.
To run STORM with gpt
family models with default configurations:
secrets.toml
to set up the API keys. Create a file secrets.toml
under the root directory and add the following content:
# Set up OpenAI API key.
OPENAI_API_KEY="your_openai_api_key"
# If you are using the API service provided by OpenAI, include the following line:
OPENAI_API_TYPE="openai"
# If you are using the API service provided by Microsoft Azure, include the following lines:
OPENAI_API_TYPE="azure"
AZURE_API_BASE="your_azure_api_base_url"
AZURE_API_VERSION="your_azure_api_version"
# Set up You.com search API key.
YDC_API_KEY="your_youcom_api_key"
python examples/run_storm_wiki_gpt.py \
--output-dir $OUTPUT_DIR \
--retriever you \
--do-research \
--do-generate-outline \
--do-generate-article \
--do-polish-article
To run STORM using your favorite language models or grounding on your own corpus: Check out examples/README.md.
If you have installed the source code, you can customize STORM based on your own use case. STORM engine consists of 4 modules:
The interface for each module is defined in knowledge_storm/interface.py
, while their implementations are instantiated in knowledge_storm/storm_wiki/modules/*
. These modules can be customized according to your specific requirements (e.g., generating sections in bullet point format instead of full paragraphs).
Please switch to the branch NAACL-2024-code-backup
Our team is actively working on:
If you have any questions or suggestions, please feel free to open an issue or pull request. We welcome contributions to improve the system and the codebase!
Contact person: Yijia Shao and Yucheng Jiang
We would like to thank Wikipedia for their excellent open-source content. The FreshWiki dataset is sourced from Wikipedia, licensed under the Creative Commons Attribution-ShareAlike (CC BY-SA) license.
We are very grateful to Michelle Lam for designing the logo for this project and Dekun Ma for leading the UI development.
Please cite our paper if you use this code or part of it in your work:
@inproceedings{shao2024assisting,
title={{Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models}},
author={Yijia Shao and Yucheng Jiang and Theodore A. Kanell and Peter Xu and Omar Khattab and Monica S. Lam},
year={2024},
booktitle={Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)}
}