kasnerz / factgenie

A Toolkit for Annotating and Visualizing LLM Hallucinations
MIT License
1 stars 0 forks source link

factgenie

GitHub GitHub issues Github stars

Visualize and annotate errors in LLM outputs.

🚧 The project is in progress; use at your own risk. 🚧

Main screen

Intro

Outputs from large language models (LLMs) may contain errors: semantic, factual, and lexical.

With factgenie, you can have the errors highlighted 🌈:

How does factgenie help with that?

  1. It helps you create a user-friendly website for collecting annotations from human crowdworkers.
  2. It helps you with LLM API calls for collecting equivalent annotations from LLM-based evaluators.
  3. It provides you with visualization interface for inspecting the annotated outputs.

What does factgenie not help with is collecting the data or model outputs (we assume that you already have these), starting the crowdsourcing campaign (for that, you need to use a service such as Prolific.com) or running the LLM evaluators (for that, you need a local framework such as Ollama or a proprietary API).


This project is a framework and template for you, dear researcher. Help us improve it! :wink:

Quickstart

Make sure you have Python 3 installed (the project is tested with Python 3.10).

The following commands install the package, start the web server, and open the front page in the browser:

pip install -e .
factgenie run --host=127.0.0.1 --port 5000
xdg-open http://127.0.0.1:5000  # for Linux it opens the page for you

Step-by-step guide

Each project is unique. That is why this framework is partially DIY: we assume that it will be customized for a particular use case.

0) Setup Dependencies

The factgenie uses ollama to run local LLMs and openai-python API to OpenAI LLMs to gather annotations. For crowdsourcing campaigns, it was designed to easily integrate into prolific.com workflow.

Read their documentation to set it up, we will prepare step-by-step guides in the future.

For setting up the LLMs for annotation, the factgenie needs just the ollama or openai URL to connect to. From Prolific, one needs to obtain a completion code, which will be displayed to the annotators as proof of completed work for the prolific.com

1) Gather your inputs and outputs

Make sure you have input data and corresponding model outputs from the language model.

By input data, we mean anything that will help the annotators with assessing the factual accuracy of the output.

See the factgenie/data folder for example inputs and the factgenie/outputs folder for example model outputs.

The input data can have any format visualizable in the web interface - anything from plain text to advanced charts. The model outputs should be in plain text.

2) Prepare a data loader

Write a data loader class for your dataset. The class needs to subclass the Dataset class in factgenie/loaders/dataset.py and implement its methods.

Notably, you need to implement:

You can get inspired by the example datasets in factgenie/loaders/dataset.py.

3) Run the web interface

To check that everything works as expected, fire up the web interface πŸ”₯

First, install the Python package (the project is tested with Python 3.10):

pip install -e .

Start the local web server:

factgenie run --host=127.0.0.1 --port 8890

After opening the page http://127.0.0.1:8890 in your browser, you should be able to see the front page:

Main screen

Go to /browse. Make sure that you can select your dataset in the navigation bar and browse through the examples.

4) Annotate the outputs with LLMs

For collecting the annotations from an LLM, you will first need to get access to one. The options we recommend are:

In general, you can integrate factgenie with any API that allows decoding responses as JSON (or any API as long as you can get a JSON by postprocessing the response).

You also need to customize the YAML configuration file in factgenie/llm-eval by setting the model prompt optionally along with the system message, model parameters, etc. Keep in mind the prompt needs to ask the model to produce JSON outputs in the following format:

{
  "errors": [
    { 
      "text": [TEXT_SPAN],
      "type": [ERROR_CATEGORY]},
    ...
  ]
}

The provided examples should help you with setting up the prompt.

Once you have the configuration file ready, you should:

Your eval should appear in the list:

Main screen

Now you need to go to the campaign details and run the evaluation. The annotated examples will be marked as finished:

Main screen

5) Annotate the outputs with human crowd workers

For collecting the annotations from human crowd workers, you typically need to:

πŸ‘‰οΈ With factgenie, you won't need to spend almost any time with any of these!

Starting a campaign

First, we will start a new campaign:

Your campaign should appear in the list:

Main screen

You can now preview the annotation page by clicking on the πŸ‘οΈβ€πŸ—¨οΈ icon. If a crowd worker opens this page, the corresponding batch of examples will be assigned to them.

Since we are using the dummy PROLIFIC_PID parameter (test), we can preview the page and submit annotations without having this particular batch assigned.

Customizing the annotation page

And now it's your turn. To customize the annotation page, go to factgenie/templates/campaigns/<your_campaign_id> and modify the annotate.html file.

You will typically need to write custom instructions for the crowd workers, include Javascript libraries necessary for rendering your inputs, or write custom Javascript code.

You can get inspired by the example campaign in factgenie/templates/campaigns/.

Submit the annotations from the Preview page (and delete the resulting files) to ensure that everything works from your point of view.

Main screen

Launch the crowdsourcing campaign

By clicking on the Details button, you can get the link that you can paste on Prolific. By now, you need to run the server with a public URL so that it is accessible to the crowdworkers.

On the details page, you can monitor how individual batches get assigned and completed.

6) View the results

Once the annotations are collected, you can view them on the /browse. The annotations from each campaign can be selected in the drop-down menu above model outputs.

Main screen

Core Developers

Optional use of Git Large File Storage (git lfs)