interstellard / chatgpt-advanced

WebChatGPT: A browser extension that augments your ChatGPT prompts with web results.
https://webchatgpt.app
MIT License
6.45k stars 838 forks source link

Use mozilla/readability to extract the text content of webpages #83

Closed tomasgvivo closed 1 year ago

tomasgvivo commented 1 year ago

I would like the extension to include the websites content, not only the headlines. The package https://github.com/mozilla/readability allows you to extract the text from a website for readability (this is what Firefox uses on it's "reader mode"). This may simplify the process of extracting the texts from the websites, but obviously you will get to the problem of token limit for a message.

After reading other issues, I think that if you divide the process of accessing the web in various steps/prompts you might be able to avoid that limit by separating the results in multiple messages.

A couple of days ago, I tried something like this:

You are now in "Text Ingestion Mode".

When I send you a message, reply with '...'.
If I send you the string "EOM", exit "Text Ingestion Mode".

At first it worked and responded with "..." after the first text to ingest, but after the second text, it jumped to conclusions and tried to give an opinion about the text I provided. I think this is just a problem with the initial prompt and after some tweaks it should work most of the times.

So what I imagine is:

User:
{prompt}

GPT:
Welcome to WebGPT [...] How can I help you?

User:
{query}

GPT:
SEARCH: {gpt_generted_query}

User:
Result 1/5
{content of first result's site}

GPT:
Next result.

User:
Result 2/5
{content of first result's site}
GPT:
Next result.

User:
Result 3/5
{content of first result's site}
GPT:
Next result.

User:
Result 4/5
{content of first result's site}
GPT:
Next result.

User:
Result 5/5
{content of first result's site}

GPT:
{gpt_answer_to_query}
qunash commented 1 year ago

Already testing mozilla/readability for text extraction. If you'd like to try it, checkout the serverless branch and build the extension from source. Then type in /page:url to extract text from the url.