Empowering knowledge exploration and generation through LLamaIndex and RAG on the full WIKI knowledge base.
https://github.com/lyyf2002/RAG-WIKI/assets/57008530/acbac812-6924-4d6d-9e4f-309871dfd211
Clone this repo git clone https://github.com/lyyf2002/RAG-WIKI
Download a subset of WIKI processed by me which only has 200MB text.
Ensure those files satisfy the following file hierarchy: (storage
is the path that stores the index)
ROOT
├── wiki
├── storage
└── RAG-WIKI
To process the full WIKI or other data, please follow the 5-7.
Download the full WIKI data you like from Wikipedia database backup dump, e.g. https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 for English.
use wikiextractor to get the cleaned text for the wiki dataset.
wikiextractor -o wiki --json --no-templates enwiki-latest-pages-articles.xml.bz2
storage
will be created by the app.py
when you first run it. You can change the path to get different index stored before.
cd RAG-WIKI
Update the api_base
and api_key
in app.py
. You can get a free key for test at https://github.com/chatanywhere/GPT_API_free
install the Dependencies:
pip install streamlit
pip install llama-index
pip install langchain
run streamlit run app.py