RAG-WIKI

Empowering knowledge exploration and generation through LLamaIndex and RAG on the full WIKI knowledge base.

Startup 🚀

Clone this repo git clone https://github.com/lyyf2002/RAG-WIKI
Download a subset of WIKI processed by me which only has 200MB text.
Ensure those files satisfy the following file hierarchy: (storage is the path that stores the index)
```
ROOT
├── wiki
├── storage
└── RAG-WIKI
```
To process the full WIKI or other data, please follow the 5-7.
Download the full WIKI data you like from Wikipedia database backup dump, e.g. https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 for English.

use wikiextractor to get the cleaned text for the wiki dataset.

wikiextractor -o wiki --json --no-templates enwiki-latest-pages-articles.xml.bz2

storage will be created by the app.py when you first run it. You can change the path to get different index stored before.
cd RAG-WIKI
Update the api_base and api_key in app.py. You can get a free key for test at https://github.com/chatanywhere/GPT_API_free

install the Dependencies:

pip install streamlit
pip install llama-index
pip install langchain