Closed garethrees closed 3 weeks ago
We discussed this on the 2024-07-25 GAML call.
There are a couple of things that came up to figure out as we get stuck in to the project, and no doubt more will arise. We’ll use this project to document these questions, our thoughts around them, and the answers or solutions we come up with.
Guardrails: We’re satisfied that given the collaboration context, we don’t need to consider prompt injection via user generated dataset questions. We should make it clear in the UI where we’re presenting experimental AI generated content, and ensure CE UK drill the message in volunteer comms. The scorecards project has multiple layers of data quality checking.
Real Time: There was some discussion about whether the suggestions would be generated in real time. This is up for debate, but I was imagining some sort of background job – either triggered by an appropriate event (on receiving a response, on successful classification, etc) or on a timer (nightly, every few hours, etc).
Hosting: We’re only aiming run any models for the data development and extraction period of the scorecards project (a month or two). A 128GB RAM VM wouldn’t be out of the question for the experiment.
General toolchain: To what extent can we develop a toolchain that locks down generated responses to the specific input content we provide? Kip generated vector embeddings. What does that add? Do we need it?
Sending input data: What’s the right balance between larger context window and pre-processing our input with traditional code or other tools? Do we send each question as a separate prompt, or all of the questions at once?
Dealing with output: Getting the Answer into the right format might take a few attempts depending on the model. How do we minimise the resources needed to do this?
Balancing between completeness and utility: We know we won’t be able to always parse out an answer, but that’s fine as we’re treating these as progressive enhancement. LLMs struggle with tables in PDFs and spreadsheets. Do we use a model that’s good at handling these (since we know we get a lot), or do we use other pre-processing? Where’s the balance of what’s worth doing?
Done for now
Wider context: Climate Emergency are re-running the scorecards project this year, which includes gathering some of the data through WhatDoTheyKnow and having volunteers extract datasets with Projects. We’d like to help make their volunteers’ time contributions more efficient by using AI to enhance crowdsourcing capabilities in Projects for PJMF.
Kip (CEUK volunteer) has experimented with this on a subset of the previous requests and shared the output. His toolchain for the experimentation was: Langchain → Vector embeddings → Add metadata (filename, council, original text, etc) → Open AI for the questions.
At the minimum we could build an isolated prototype outside of Alaveteli to explore the mechanics and interface ideas, but as we know, opportunities for progress can be sporadic, so it’s good if we can get some significant building blocks in. In any case, we’re not aiming for general availability immediately, so anything we do build into Alaveteli can be feature flagged and limited to this specific use case with relatively controlled data.
Projects: Suggestions
Extracting datasets from lots of FOI requests in Projects can be quite laborious and time consuming. This is where we’ll dip our toes into the water with AI.
Project Suggestions will increase the efficiency of the data extraction process, but won’t remove the human-in-the-loop. It’ll be a progressive enhancement that can help contributors more easily navigate to documents of interest and locate answers.
For each question in the extract page sidebar we’ll add a “Suggestions” element. This will open a popover that will be our place for displaying the AI-generated assistance.
Popover Content
Here we’ll display three main pieces of information.
1. Answer: The answer suggested by the AI. We’ll have a button that the contributor can click to populate the field with that value.
This will be compliant with the requirements of the answer field (i.e. a number for a numeric field, Yes/No for a boolean field).
We could explore including some sort of confidence rating, but it’s not essential for a first pass1.
2. Excerpt: An excerpt of the input data that contains the Answer. This will help the contributor understand how the Answer was arrived at.
It would be good to have some sanity checking that the Excerpt actually exists in the Source data1.
For multi-page attachments it would be great if we could display the specific page number that we think the Answer is on1.
3. Source: The details of the specific piece of the request thread that contains the Answer (e.g. “Response on YYYY-MM-DD” or “Attachment with filename Foo.pdf”). We’ll have a button that jumps us directly to the source content.
Fallbacks
In some cases we won’t be able to populate all three of the suggestion fields. For example, there might be some attachment types we can’t run through the AI, or the way the authority answers isn’t clear enough.
Where this happens we’ll display the ones that we can and have an “Unknown” value where we can’t.
We shouldn’t ever display an Answer or Excerpt without a Source.
Persistence
We should store the AI-generated suggestions so that we can later compare against what was submitted by the contributors. If we're using their answers to train a model, we'll need to make people aware we might do that. We can communicate this manually to CEUK volunteers, but for general use we can include a line in the interface that makes this clear.
Configuring
Whatever model and interface we use must be local, to avoid privacy issues, and ideally OpenAI API compatible so that the setup and prompt interface requires less change should we want to swap models.
We’ll want this to be configurable at theme level for now. For example:
Projects should not require AI Suggestions to be configured. It will work as it does currently with manual answers.
We should consider that in future we might want to run multiple models to generate the suggestions as a way of sense-checking accuracy.
Project suggestions should be feature flagged per Project.