Closed simonw closed 1 year ago
That demo is deployed from https://github.com/simonw/simonwillisonblog-backup
First version I'll use the DB I already uploaded to https://gist.githubusercontent.com/simonw/30954443717f770d5cb9c0219bee3d9b/raw/b449c12b6cf146721afb23762316dbe5c42c11c0/blog.db (24MB) - I'll download it again and use sqlite-utils
to copy the embeddings over to the simonwillisonblog.db
database.
This creates the table:
sqlite-utils create-table simonwillisonblog.db blog_entry_embeddings \
id integer embedding blob --pk id --ignore
And this populates it:
sqlite-utils simonwillisonblog.db --attach embeddings blog.db \
'replace into blog_entry_embeddings select cast(id as integer), embedding from embeddings.embeddings'
The replace into
means it won't throw an error if the row already exists.
I had to click this: https://github.com/simonw/simonwillisonblog-backup/actions/workflows/backup.yml
It works!
Needs a real OpenAI token for that to return results though.
Here's a demo query that shows top related entries based on a blog entry ID: https://datasette.simonwillison.net/simonwillisonblog?sql=with+original+as+(%0D%0A++select+embedding+from+blog_entry_embeddings+where+id+%3D+%3Aid%0D%0A)%2C%0D%0Atop_10+as+(%0D%0A++select+id%2C%0D%0A++openai_embedding_similarity(original.embedding%2C+blog_entry_embeddings.embedding)+as+score+%0D%0A++from+blog_entry_embeddings%2C+original%0D%0A++where+id+!%3D+%3Aid%0D%0A++order+by+score+desc%0D%0A++limit+10%0D%0A)%0D%0Aselect+top_10.score%2C+blog_entry.*%0D%0Afrom+top_10+join+blog_entry+on+top_10.id+%3D+blog_entry.id&id=8000
Here's a demo that doesn't need an API key - the first page runs a regular FTS search for a term and returns the top items with links to another page:
This query totally worked too:
with query as (
select
openai_embedding(:query, :token) as q
),
top_n as (
select
id,
openai_embedding_similarity(query.q, embedding) as score
from
blog_entry_embeddings, query
order by
score desc
limit
5
),
content as (select
blog_entry.id,
blog_entry.title,
substr(blog_entry.body, 0, 3000) as content,
top_n.score
from
blog_entry
join top_n on blog_entry.id = top_n.id
order by
score desc
)
select openai_davinci(group_concat(content, ' ') || '
----
Given the above content, answer the following question: ' || :query, 256, 0.7, :token)
as response from content
It takes the top 5 results by semantic similarity, concatenates together their content (the first 3000 characters of each entry) into a chunk of text, then adds that prompt at the end - and it can then answer questions based on the content of my blog.
I'm going to deploy a live demo to my https://datasette.simonwillison.net/ instance, along with a copy of the embeddings table I built while trying this out: