harvard-lil / warc-gpt

WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.
https://lil.law.harvard.edu/blog/2024/02/12/warc-gpt-an-open-source-tool-for-exploring-web-archives-with-ai/
MIT License
226 stars 21 forks source link
ai rag warc webarchiving

WARC-GPT

WARC + AI: Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.

More info:

https://github.com/harvard-lil/warc-gpt/assets/625889/8ea3da4a-62a1-4ffa-a510-ef3e35699237


Summary


Features

☝️ Summary


Installation

WARC-GPT requires the following machine-level dependencies to be installed.

Use the following commands to clone the project and install its dependencies:

git clone https://github.com/harvard-lil/warc-gpt.git
cd warc-gpt
poetry env use 3.11
poetry install

If you don't want to use Poetry, or are in some context where that doesn't work, you can clone the repo, create a virtual environment, and install the dependencies like this:

git clone https://github.com/harvard-lil/warc-gpt.git
cd warc-gpt
python3 -m venv env
. env/bin/activate
pip install .

If you choose this method, remove the prefix poetry run from the commands below.

☝️ Summary


Configuring the application

This program uses environment variables to handle settings. Copy .env.example into a new .env file and edit it as needed.

cp .env.example .env

See details for individual settings in .env.example.

A few notes:

☝️ Summary


Ingesting WARCs

Place the WARC files you would to explore with WARC-GPT under ./warc and run the following command to:

poetry run flask ingest

# May help with performance in certain cases: only ingest 1 chunk of text at a time.
poetry run flask ingest --batch-size 1

Note: Running ingest clears the ./chromadb folder.

☝️ Summary


Starting the server

The following command will start WARC-GPT's server on port 5000.

poetry run flask run
# Not: Use --port to use a different port

☝️ Summary


Interacting with the WEB UI

Once the server is started, the application's web UI should be available on http://localhost:5000.

Unless RAG search is disabled in settings, the system will try to find relevant excerpts in its knowledge base - populated ahead of time using WARC files and the ingest command - to answer the questions it is asked.

The interface also automatically handles a basic chat history, allowing for few-shots / chain-of-thoughts prompting.

☝️ Summary


Interacting with the API

[GET] /api/models

Returns a list of available models as JSON.

[POST] /api/search

Performs search against the vector store for a given message.

Accepts a JSON body with the following properties: - `message`: User prompt (required)
Returns a JSON array of objects containing the following properties: - `[].warc_filename`: Filename of the WARC from which that excerpt is from. - `[].warc_record_content_type`: Can start with either `text/html` or `application/pdf`. - `[].warc_record_id`: Individual identifier of the WARC record within the WARC file. - `[].warc_record_date`: Date at which the WARC record was created. - `[].warc_record_target_uri`: Filename of the WARC from which that excerpt is from. - `[].warc_record_text`: Text excerpt.

[POST] /api/complete

Uses an LLM to generate a text completion.

Accepts a JSON body with the following properties: - `model`: One of the models `/api/models` lists (required) - `message`: User prompt (required) - `temperature`: Defaults to 0.0 - `max_tokens`: If provided, caps number of tokens that will be generated in response. - `search_results`: Array, output of `/api/search`. - `history`: A list of chat completion objects representing the chat history. Each object must contain `user` and `content`.

Returns RAW text stream as output.

☝️ Summary


Visualizing embeddings

WARC-GPT allows for generating basic interactive T-SNE 2D scatter plots of the vector stores it generates.

Use the visualize command to do so:

poetry run flask visualize

visualize takes a --questions option which allows to place questions on the plot:

poetry run flask visualize --questions="Who am I?;Who are you?"

☝️ Summary


Disclaimer

The Library Innovation Lab is an organization based at the Harvard Law School Library. We are a cross-functional group of software developers, librarians, lawyers, and researchers doing work at the edges of technology and digital information.

Our work is rooted in library principles including longevity, authenticity, reliability, and privacy. Any work that we produce takes these principles as a primary lens. However due to the nature of exploration and a desire to prototype our work with real users, we do not guarantee service or performance at the level of a production-grade platform for all of our releases. This includes WARC-GPT, which is an experimental boilerplate released under MIT License.

Successful experimentation hinges on user feedback, so we encourage anyone interested in trying out our work to do so. It is all open-source and available on Github.

Please keep in mind:

☝️ Summary