scripting / Scripting-News

I'm starting to use GitHub for work on my blog. Why not? It's got good communication and collaboration tools. Why not hook it up to a blog?
117 stars 10 forks source link

Seeding a chatbot with the archive of my blog #254

Open scripting opened 1 year ago

scripting commented 1 year ago

I wrote a blog post about my interest in using ChatGPT-like tech to access the knowledge in my blog.

This is a place to comment on that post. ;-)

OleEichhorn commented 1 year ago

As you solve the problem of how to use Scripting News archives as data underlying GPT-based chat tools, consider that yours is not a unique problem. There are many, many websites which have valuable information histories, and how great would it be if you could just point a tool at a website and use GPT-level queries against the information contained in it?

Not only blogs (there are many, many blogs) but all kinds of other sites store information histories and would be great to use them as data. Imagine using archive.org as the back-end database!

Will be watching this with great interest :)

waltzzz commented 1 year ago

I have Wordpress blog that I'd love to see in a corpus. Thx Dave, Ole, good thinking!

sethgodin1 commented 1 year ago

This is one of those things that's definitely going to happen, as surely as primordial creatures are going to evolve over time to develop claws and eyes.

The real question is: who cares enough to build a good one first!

scripting commented 1 year ago

Seth this belongs on your blog!

And I’m so glad we both see this. 😀

endlessforms01 commented 1 year ago

https://twitter.com/dsiroker/status/1638799931891920897?s=61&t=7Yuj_DQUxof8CC-tlS5Epg

craigschmidt commented 1 year ago

It sounds like the ChatGPT plugins will allow this sort of feature. You could try this:

https://www.theverge.com/2023/3/23/23653127/rewind-ai-chatgpt-for-me-chatbot-mac

sethgodin1 commented 1 year ago

Just installed Rewind. It's so clunky. But it'll get there.

Lindy pretends that it's reading your email and such, but it doesn't really.

Here's what I described a few weeks ago:

https://seths.blog/2023/02/the-unaware-snoop/

not there yet.

On Fri, Mar 24, 2023 at 9:08 AM, Craig Schmidt < @.*** > wrote:

It sounds like the ChatGPT plugins will allow this sort of feature. You could try this:

https:/ / www. theverge. com/ 2023/ 3/ 23/ 23653127/ rewind-ai-chatgpt-for-me-chatbot-mac ( https://www.theverge.com/2023/3/23/23653127/rewind-ai-chatgpt-for-me-chatbot-mac )

— Reply to this email directly, view it on GitHub ( https://github.com/scripting/Scripting-News/issues/254#issuecomment-1482770081 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/ATOZXKFID26I64X5ZX7NQXLW5WMDPANCNFSM6AAAAAAWFSWZ5I ). You are receiving this because you commented. Message ID: <scripting/Scripting-News/issues/254/1482770081 @ github. com>

scripting commented 1 year ago

I asked a question of ChatGPT:

"I have a blog with archives going back to 1994. I'd like to have that content loaded into ChatGPT so it can be part of the knowledge base. I am a JavaScript programmer. What's the easiest way to get a quick result?"

I understand the code, I think -- but I'm no closer to understanding what to do. How do I come up with the prompts? What if I have no prompts, what's the result? They say be careful of the cost, what is the cost based on.

I feel lost like I did when I first saw what the web could do. I wanted to know -- how do I set up a node on the next. The answers I got from those in the know made no sense to me, until they did.

bradbarrish commented 1 year ago

This might help https://github.com/openai/chatgpt-retrieval-plugin

jsavin commented 1 year ago

I learned the other day that OpenAI is opening their plugin framework to external developers gradually, rolling out via a waitlist: https://openai.com/blog/chatgpt-plugins

From what I can tell, plugins will do two things:

  1. For people with data to integrate with ChatGPT they'll be able to make a plugin that makes that data available to the AI model. In this case a generalized RSS plugin might make sense from the broadest perspective.
  2. On the user side, plugins can be enabled or disabled (again from what I know so far) to turn on or off different data integrations.

I haven't looked too closely at the developer docs yet, but it may be possible to make an RSS plugin that enables ChatGPT to filter to data from a single feed, or from a set of feeds associated with a domain. And if an RSS plugin is possible, then an OPML plugin is most likely possible too. I imagine both of these are simpler than the Wolfram plugin they demo here: https://twitter.com/OpenAI/status/1638952876281335813

I've been chatting with some folks about the broader implications of connecting LLMs to external applications. My gut tells me this is going to be a very powerful architecture, and that we have no idea what kinds of things might be unlocked by this type of integration (both good and bad IMHO).

Newman5 commented 1 year ago

I think Fireship.io talked about this here - https://www.youtube.com/watch?v=mpnh1YTT66w

jsavin commented 1 year ago

Off-topic warning – feel free to ignore/skip…

Part of that video are a little much, for example talking about ChatGPT spawning additional LLM child instances to delegate work to. This would violate some safety principles in terms of allowing AIs to extend their own power in this way. (There are already examples of AI engines exhibiting non-designed "power-seeking" behavior in order to fulfill their designed goals.)

As I understand it OpenAI is currently limiting GPT's access to external APIs to read-only requests. It's a little worrying that there could be unintended side-effects of this, for example Zapier enables people to create functionality in Zapier that's triggered via GET requests, and which have a side effect of making a write call to some other system. In theory this is an avenue via which GPT could escape the controls OpenAI has in place with the help of a human actor (intentional or unintentional).

More broadly though, there are scores of LLMs out there, and even if OpenAI does all the right things, it's pretty likely that over the next few years there will be bad actors doing dangerous things with AI models, and it's not completely outside the realm of possibility that AIs could cause a lot of damage, most likely not intentional, but some will be intentional as well. (Imagine maleficent nation states or terror groups using AIs to generate propaganda or instruct coding AIs to attack systems.)

jsavin commented 1 year ago

Here's a blog post on OpenAI about accessing plugins with links to developer docs: https://help.openai.com/en/articles/7183286-chatgpt-plugins

BenAtWide commented 1 year ago

If you want an alternative to the official ChatGPT plugin framework, I think the LangChain project could be the most promising set of tools to get this up and running. It is aiming to connect LLMs to other sorts of data (APIs, documents), with code to create your own chatbots and Q and A interfaces.

This blog post talks about using it with a corpus of documents:

Ever since ChatGPT came out, people have been building a personalized ChatGPT for their data. We even wrote a tutorial on this, and then ran a competition about this a few months ago. The desire and demand for this highlights an important limitation of ChatGPT - it doesn't know about YOUR data, and most people would find it more useful if it did. So how do you go about building a chatbot that knows about your data?

I have played very briefly with the Python libraries to connect to an API (although not yet on a document corpus), there is a Node/Typescript impelementation as well.

scripting commented 1 year ago

Thanks everyone for all the comments! I haven't had a chance to review them yet, but I will.

In the meantime, I just got access to Bard and asked it how I can add my own data to its dataset, and it sounds pretty straightforward.

image
scripting commented 1 year ago

Then I asked what if I have thousands of files? This is what they said.

image
sethgodin1 commented 1 year ago

that's great news Dave

can't wait to see if it works!

On Sun, Mar 26, 2023 at 9:51 AM, Dave Winer < @.*** > wrote:

Then I asked what if I have thousands of files? This is what they said.

image ( https://user-images.githubusercontent.com/1686843/227780386-7c6f2837-810e-4557-bc4c-e36507830bc9.png )

— Reply to this email directly, view it on GitHub ( https://github.com/scripting/Scripting-News/issues/254#issuecomment-1484103010 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/ATOZXKDTQSL76YMJ4OLKNYTW6BCXDANCNFSM6AAAAAAWFSWZ5I ). You are receiving this because you commented. Message ID: <scripting/Scripting-News/issues/254/1484103010 @ github. com>

donpark commented 1 year ago

ChatGPT Plugin API is relevant but it's not yet generally available.

Meanwhile, LlamaIndex aka GPT-Index should do what you need. Checkout this how-to post. https://medium.datadriveninvestor.com/querying-external-kbs-through-gpt-gpt-index-llamaindex-fd8cbad2a4c

As to loading the blog archive. I don't yet see an OPML loader but there is an RSS loader. See: https://llamahub.ai/l/web-rss

If you need to go beyond what LlamaIndex does, LangChain should extend your reach at the cost of small increase in complexity. YMMV.

what if I have thousands of files?

Point LlamaIndex's SimpleDirectoryReader at the directory containing the files and pass it on an indexer.

scripting commented 1 year ago

Here's a zipped archive of twelve JSON files for all of scripting.com blog for 2022.

https://scripting.com/publicfolder/chatgpt/scriptingNewsSource/2022.zip

I followed the instructions Bard gave me, but they tell you to run a Python app that they can't say where you get it.

I'll keep looking for an easy way to upload those files. :smile:

donpark commented 1 year ago

I indexed above JSON files using LlamaIndex as is without any OPML specific processing.

Indexing

code:

from llama_index import GPTTreeIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader('opmltextfiles').load_data()
index = GPTTreeIndex(documents)
index.save_to_disk('index_tree.json')

Took nearly 30mins to index 1.4M tokens from your 2022 blog posts.

output:

INFO:llama_index.indices.common.tree.base:> Building index from nodes: 346 chunks
INFO:llama_index.indices.common.tree.base:> Building index from nodes: 34 chunks
INFO:llama_index.indices.common.tree.base:> Building index from nodes: 3 chunks
INFO:llama_index.token_counter.token_counter:> [build_index_from_documents] Total LLM token usage: 1391651 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_documents] Total embedding token usage: 0 tokens

Querying

code:

from llama_index import GPTTreeIndex
index = GPTTreeIndex.load_from_disk('index_tree.json')
index.query("What's so great about RSS?")

output:

INFO:llama_index.indices.query.tree.leaf_query:> Starting query: What's so great about RSS?
INFO:llama_index.indices.query.tree.leaf_query:>[Level 0] Selected node: [4][/](https://file+.vscode-resource.vscode-cdn.net/)[4]
INFO:llama_index.indices.query.tree.leaf_query:>[Level 1] Selected node: [3][/](https://file+.vscode-resource.vscode-cdn.net/)[3]
INFO:llama_index.indices.query.tree.leaf_query:>[Level 2] Selected node: [3][/](https://file+.vscode-resource.vscode-cdn.net/)[3]
INFO:llama_index.indices.query.tree.leaf_query:>[Level 3] Selected node: [4][/](https://file+.vscode-resource.vscode-cdn.net/)[4]
INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 7827 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 0 tokens

answer:

RSS is a great way to stay up to date with the latest news and information. It allows users to quickly and easily search for topics of interest and sort the results by title, how recently they've been updated, and the number of people subscribing to the feed. This makes it easy to find the most relevant and up-to-date information quickly and easily.

Not bad. You should get better answers with OPML-aware loader. Total cost of this little ad-hoc exercise was a few cups of coffee and 30mins. Size of index_tree.json file was 3.8MB. Given zero data preparation was done and loader only saw the files as JSON and not OPML, I don't think the file is of much use but let me know and I'll upload it here for you to query against.

UPDATE: Uploaded it anyway.

index_tree.json.zip

donpark commented 1 year ago

There is a bit of complication w/setup and install:

  1. You'll need OpenAI API key and set env. vars: OPENAI_ORG_ID and OPENAI_API_KEY.
  2. Use python version 3.10.10
  3. pip install langchain llama-index
  4. package the query code from above and run after changing query text.
scripting commented 1 year ago

@donpark -- thanks!

I'm not sure what to do with the .zip file you uploaded.

Can we work together on this? I'll cover expenses, and your time. Just like we did back in 1988 or whatever. :smile:

I really want to see what it's like to be able to ask questions about all my content going back to 1994. I have all the archives. I can also transform the files to other JSON formats, and am ready to do the work.

We can switch over to private email -- dave.winer@gmail.com, as always.

scripting commented 1 year ago

I uploaded 2022-2018 and the second half of 2017. Here's the complete list of files

http://scripting.com/publicfolder/chatgpt/scriptingNewsSource/opml/2023.zip http://scripting.com/publicfolder/chatgpt/scriptingNewsSource/opml/2022.zip http://scripting.com/publicfolder/chatgpt/scriptingNewsSource/opml/2021.zip http://scripting.com/publicfolder/chatgpt/scriptingNewsSource/opml/2020.zip http://scripting.com/publicfolder/chatgpt/scriptingNewsSource/opml/2019.zip http://scripting.com/publicfolder/chatgpt/scriptingNewsSource/opml/2018.zip http://scripting.com/publicfolder/chatgpt/scriptingNewsSource/opml/2017part2.zip

I'll do the rest tomorrow. I have files going back to the mid-90s. Not quite as well organized, but close.

I can transform these to another format, if needed.

donpark commented 1 year ago

Re working together, unfortunately work and ongoing side-projects leave no time to spare. Above took only a few minutes to setup and let run while I do other stuff. As to API cost, it's minor and jumbled with my own work. Sorry about unnecessary technical details. I thought you wanted to build something to add to your servers. When and if you want to do that, LangChain.js should help. It does what LlamaIndex does and more using JavaScript.

If Q&A against your blog content is all you need then you should use 3rd party services. They're popping up like rabbit. Here is one that I saw just the other day. https://relevanceai.com/blog/use-all-of-your-blog-posts-and-docs-as-context-for-chatgpt-style-questions-and-answers

I'm of course available async if you have questions.

UPDATE: BTW, Simon Willison (@simonw on Twitter) has been doing a lot of work with ChatGPT so he's likely the best person to work with. Here is his latest post: https://simonwillison.net/2023/Mar/24/datasette-chatgpt-plugin/

scripting commented 1 year ago

I've been working with Automattic on a new chatbot for Scripting News.

https://github.com/scripting/Scripting-News/issues/267