Deploy a live demo - Githubissues

simonw commented 1 year ago

I'm going to deploy a live demo to my https://datasette.simonwillison.net/ instance, along with a copy of the embeddings table I built while trying this out:

https://github.com/simonw/openai-to-sqlite/issues/1

simonw commented 1 year ago

That demo is deployed from https://github.com/simonw/simonwillisonblog-backup

simonw commented 1 year ago

First version I'll use the DB I already uploaded to https://gist.githubusercontent.com/simonw/30954443717f770d5cb9c0219bee3d9b/raw/b449c12b6cf146721afb23762316dbe5c42c11c0/blog.db (24MB) - I'll download it again and use sqlite-utils to copy the embeddings over to the simonwillisonblog.db database.

simonw commented 1 year ago

This creates the table:

sqlite-utils create-table simonwillisonblog.db blog_entry_embeddings \
  id integer embedding blob --pk id --ignore

And this populates it:

sqlite-utils simonwillisonblog.db --attach embeddings blog.db \
  'replace into blog_entry_embeddings select cast(id as integer), embedding from embeddings.embeddings'

The replace into means it won't throw an error if the row already exists.

simonw commented 1 year ago

I had to click this: https://github.com/simonw/simonwillisonblog-backup/actions/workflows/backup.yml

simonw commented 1 year ago

It works!

https://datasette.simonwillison.net/simonwillisonblog?sql=with+query+as+%28%0D%0A++select%0D%0A++++openai_embedding%28%3Aquery%2C+%3Atoken%29+as+q%0D%0A%29%2C%0D%0Atop_n+as+%28%0D%0A++select%0D%0A++++id%2C%0D%0A++++openai_embedding_similarity%28query.q%2C+embedding%29+as+score%0D%0A++from%0D%0A++++blog_entry_embeddings%2C+query%0D%0A++order+by%0D%0A++++score+desc%0D%0A++limit%0D%0A++++10%0D%0A%29%0D%0Aselect%0D%0A++blog_entry.id%2C%0D%0A++blog_entry.title%2C%0D%0A++top_n.score%0D%0Afrom%0D%0A++blog_entry%0D%0A++join+top_n+on+blog_entry.id+%3D+top_n.id%0D%0Aorder+by%0D%0A++score+desc%0D%0Alimit+10&query=prompt+injection&token=sk-

Needs a real OpenAI token for that to return results though.

simonw commented 1 year ago

Here's a demo query that shows top related entries based on a blog entry ID: https://datasette.simonwillison.net/simonwillisonblog?sql=with+original+as+(%0D%0A++select+embedding+from+blog_entry_embeddings+where+id+%3D+%3Aid%0D%0A)%2C%0D%0Atop_10+as+(%0D%0A++select+id%2C%0D%0A++openai_embedding_similarity(original.embedding%2C+blog_entry_embeddings.embedding)+as+score+%0D%0A++from+blog_entry_embeddings%2C+original%0D%0A++where+id+!%3D+%3Aid%0D%0A++order+by+score+desc%0D%0A++limit+10%0D%0A)%0D%0Aselect+top_10.score%2C+blog_entry.*%0D%0Afrom+top_10+join+blog_entry+on+top_10.id+%3D+blog_entry.id&id=8000

simonw commented 1 year ago

Here's a demo that doesn't need an API key - the first page runs a regular FTS search for a term and returns the top items with links to another page:

https://datasette.simonwillison.net/simonwillisonblog?sql=select%0D%0A++blog_entry.id%2C%0D%0A++blog_entry.title%2C%0D%0A++%27https%3A%2F%2Fdatasette.simonwillison.net%2Fsimonwillisonblog%3Fsql%3Dwith%2Boriginal%2Bas%2B%2528%250D%250A%2B%2Bselect%250D%250A%2B%2B%2B%2Bembedding%250D%250A%2B%2Bfrom%250D%250A%2B%2B%2B%2Bblog_entry_embeddings%250D%250A%2B%2Bwhere%250D%250A%2B%2B%2B%2Bid%2B%253D%2B%253Arelated_to_entry_id%250D%250A%2529%252C%250D%250Atop_10%2Bas%2B%2528%250D%250A%2B%2Bselect%250D%250A%2B%2B%2B%2Bid%252C%250D%250A%2B%2B%2B%2Bopenai_embedding_similarity%2528%250D%250A%2B%2B%2B%2B%2B%2Boriginal.embedding%252C%250D%250A%2B%2B%2B%2B%2B%2Bblog_entry_embeddings.embedding%250D%250A%2B%2B%2B%2B%2529%2Bas%2Bscore%250D%250A%2B%2Bfrom%250D%250A%2B%2B%2B%2Bblog_entry_embeddings%252C%250D%250A%2B%2B%2B%2Boriginal%250D%250A%2B%2Bwhere%250D%250A%2B%2B%2B%2Bid%2B%2521%253D%2B%253Arelated_to_entry_id%250D%250A%2B%2Border%2Bby%250D%250A%2B%2B%2B%2Bscore%2Bdesc%250D%250A%2B%2Blimit%250D%250A%2B%2B%2B%2B10%250D%250A%2529%250D%250Aselect%250D%250A%2B%2Btop_10.score%252C%250D%250A%2B%2Bblog_entry.title%252C%250D%250A%2B%2Bsubstr%2528blog_entry.body%252C%2B0%252C%2B300%2529%2Bas%2Bextract%252C%250D%250A%2B%2B%2527https%253A%252F%252Fsimonwillison.net%252Fe%252F%2527%2B%257C%257C%2Bblog_entry.id%2Bas%2Burl%250D%250Afrom%250D%250A%2B%2Btop_10%250D%250A%2B%2Bjoin%2Bblog_entry%2Bon%2Btop_10.id%2B%253D%2Bblog_entry.id%26_hide_sql%3D1%26related_to_entry_id%3D%27+%7C%7C+blog_entry.id+as+related%0D%0Afrom%0D%0A++blog_entry%0D%0A++join+blog_entry_fts+on+blog_entry.rowid+%3D+blog_entry_fts.rowid%0D%0Awhere%0D%0A++blog_entry_fts+match+%3Aq&q=prompt+injection&_hide_sql=1

That other page then shows the OpenAI embedding similarity top 10 related entries, e.g.: https://datasette.simonwillison.net/simonwillisonblog?sql=with+original+as+%28%0D%0A++select%0D%0A++++embedding%0D%0A++from%0D%0A++++blog_entry_embeddings%0D%0A++where%0D%0A++++id+%3D+%3Arelated_to_entry_id%0D%0A%29%2C%0D%0Atop_10+as+%28%0D%0A++select%0D%0A++++id%2C%0D%0A++++openai_embedding_similarity%28%0D%0A++++++original.embedding%2C%0D%0A++++++blog_entry_embeddings.embedding%0D%0A++++%29+as+score%0D%0A++from%0D%0A++++blog_entry_embeddings%2C%0D%0A++++original%0D%0A++where%0D%0A++++id+%21%3D+%3Arelated_to_entry_id%0D%0A++order+by%0D%0A++++score+desc%0D%0A++limit%0D%0A++++10%0D%0A%29%0D%0Aselect%0D%0A++top_10.score%2C%0D%0A++blog_entry.title%2C%0D%0A++substr%28blog_entry.body%2C+0%2C+300%29+as+extract%2C%0D%0A++%27https%3A%2F%2Fsimonwillison.net%2Fe%2F%27+%7C%7C+blog_entry.id+as+url%0D%0Afrom%0D%0A++top_10%0D%0A++join+blog_entry+on+top_10.id+%3D+blog_entry.id&_hide_sql=1&related_to_entry_id=8189

simonw commented 1 year ago

This query totally worked too:

with query as (
  select
    openai_embedding(:query, :token) as q
),
top_n as (
  select
    id,
    openai_embedding_similarity(query.q, embedding) as score
  from
    blog_entry_embeddings, query
  order by
    score desc
  limit
    5
),
content as (select
  blog_entry.id,
  blog_entry.title,
  substr(blog_entry.body, 0, 3000) as content,
  top_n.score
from
  blog_entry
  join top_n on blog_entry.id = top_n.id
order by
  score desc
)
select openai_davinci(group_concat(content, ' ') || '
----
Given the above content, answer the following question: ' || :query, 256, 0.7, :token)
as response from content

It takes the top 5 results by semantic similarity, concatenates together their content (the first 3000 characters of each entry) into a chunk of text, then adds that prompt at the end - and it can then answer questions based on the content of my blog.

simonw / datasette-openai

Deploy a live demo #2