harvard-lil / warc-gpt

WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.
https://lil.law.harvard.edu/blog/2024/02/12/warc-gpt-an-open-source-tool-for-exploring-web-archives-with-ai/
MIT License
218 stars 20 forks source link

warc-gpt unable to find information #3

Open peterchanws opened 7 months ago

peterchanws commented 7 months ago

Environment: Apple M1 Pro, macOS 14.3.1, Chrome

I initially uploaded 41 WARC files into WARCgpt. Among these files was an email containing titles and links to several papers related to AI. When I queried WARCgpt about the email's content regarding AI, the system responded that the email did not directly mention AI. Instead, it referenced links to product pages on B&H Photo Video's website for various computer components, such as processors and memory, with encoded parameters that specify the products linked. It was unclear how these components were connected to AI. Although I have some WARC files containing emails from B&H Photo Video, they pertain to cameras, video equipment, etc.

Later, I crawled the page https://arxiv.org/html/2303.08774v5 and ingested it into WARCgpt. Asking the system about what the email said regarding transformers, it accurately responded that the emails discussed several transformer models, such as "gpt-j-6b," "gpt-neo," "bloom," and "opt," describing them as large-scale autoregressive language models. Some emails covered aspects like training, deployment, alignment, and human data collection for these models, in addition to contributions to datasets. The emails were authored by individuals ranging from researchers and engineers to product managers at companies including Microsoft, Meta, and Google, providing the correct sources.

I'm puzzled as to why the initial 41 WARC files didn't yield the correct response.

Here's the link to the files I used: https://drive.google.com/drive/folders/1eMilyDZ9Bc3HrTuu429DtMM7oqPnbOnH warc_old.zip contains 41 warc files Archive.zip contain warc files I crawled using Browsertrix crawler

Screenshot 2024-03-06 at 10 36 35 .google.com/drive/folders/1eMilyDZ9Bc3HrTuu429DtMM7oqPnbOnH Screenshot 2024-03-06 at 10 43 12

peterchanws commented 7 months ago

I looked at the terminal when I ask "do the data talk about privacy?". I see [2024-03-06 12:12:02,661] WARNING in api: litellm could not trim messages for ollama/mistral:latest

matteocargnelutti commented 7 months ago

Hi @peterchanws

I'm puzzled as to why the initial 41 WARC files didn't yield the correct response.

Very interesting! There are a few things that you could look into here:


I looked at the terminal when I ask "do the data talk about privacy?". I see [2024-03-06 12:12:02,661] WARNING in api: litellm could not trim messages for ollama/mistral:latest

This is thankfully unrelated. This warning comes from the library we use to talk to LLM APIs. In that case it warns us that it doesn't know what the maximum context length for the current model is, and therefore cannot trim your message if that were needed.

peterchanws commented 7 months ago

I suspect the issue I have is caused by the handling warc files in Mailbagit. So, I ingested a few websites archived using Archive Web.Page. This time, the answer to my question is correct, however, the source quoted is wrong.

Question: is there information on working with email archives in the files? ollama/mistral:latest

Yes, there is information on working with email archives in the files you provided. Specifically, the Palladium Project has published guides for handling email archives as part of their work to preserve and make accessible digital collections. The project's blog post from February 2023 discusses their progress in this area and provides links to resources related to email archives.

Context: "This project is a stepping stone in our work with email archives, and there is still a great deal to do, but we are pleased to have made significant progress and taken some really positive steps."

Source: https://blog.jgc.org/2023/02/palladium-project-guides-for-email-archives/

The correct source should be https://rylandscollections.com/2022/02/18/palladium-project-guides-for-email-archives/

Somehow, the system merge another source I ingested https://blog.jgc.org/2024/02/the-original-www-proposal-is-word-for.html

Is this an issue with the LLM?