jasonjmcghee / rem

An open source approach to locally record and enable searching everything you view on your Mac.
https://rem.ing
MIT License
2.31k stars 69 forks source link

Add embedding search #17

Open jasonjmcghee opened 10 months ago

jasonjmcghee commented 10 months ago

rem should index all text via embedding store.

We could use something like https://github.com/asg017/sqlite-vss

If we go this route we should fork / open a PR to add the extension https://github.com/stephencelis/SQLite.swift/tree/3d25271a74098d30f3936d84ec1004d6b785d6cd/Sources/SQLite/Extensions

This way we can search without needing verbatim matches.

We'll need to see what the RAM footprint and insertion time is.


More out of the box solutions appear to be available now:

https://github.com/ashvardanian/SwiftSemanticSearch

We'd need to see how long insertion / index updates take, but seems super promising.

jasonjmcghee commented 10 months ago

Logged an issue over in SQLite.swift repo, but that doesn't mean we can't fork / add support / open a PR there to fulfill the issue! https://github.com/stephencelis/SQLite.swift/issues/1232

jasonjmcghee commented 10 months ago

Exploring embedding generation from swift… seems like a good candidate would be using candle (rust) with a sentence transformer, and building a binary that takes in text and outputs embeddings.

or explore CoreML and look into transformer or ONNX conversion

jasonjmcghee commented 10 months ago

I'm really bad at C bindings stuff but i tried to put together a candle text -> embeddings binary that we can talk to via FFI

https://github.com/jasonjmcghee/rust_embedding_lib

roblg commented 10 months ago

from rust_embedding_lib README.md

Am I crazy not to use https://github.com/huggingface/swift-transformers?

You might be. :) (edit: although it doesn't seem like there's a ton actually present in that library right now) I was noodling on this and I was prepared to try and embed a Python interpreter into this binary to get access to the whole ecosystem of Python modules there... ; didn't realize Swift was an option there. (Also the idea of embedding a Python interpreter into something seems kind of insane, so I just wanted to try it.)

Do you have an idea of which model embeddings you want to use for search? I've played with a couple of other projects that defaulted to bge-small-en-v1.5 -- #15 or all-mpnet-base-v2 -- #45 from HF leaderboard: https://huggingface.co/spaces/mteb/leaderboard

Both are pretty small, and "seem" good for RAG based on the limited poking I've done with them. I've never tried to use them outside of python though.

edit: n/m, I see gte-small in the rust project. That's #22 on the leaderboard!

jasonjmcghee commented 10 months ago

gte-small feels like a good balance between quality and size from manual experimentation, but totally open to suggestion and / or making it so people can use whatever they want

roblg commented 10 months ago

It looks like somebody already posted a coreml conversion of gte-small: https://huggingface.co/thenlper/gte-small/tree/main/coreml/feature-extraction/float32_model.mlpackage

I have no experience w/ this, so I don't know if that's a format we can use but I found it while researching conversion options.

I also found https://github.com/huggingface/exporters, but they appear to not support embedding models (plus I tried to do the conversion using their tool and it fails a validation step because some math is coming up with NaN.)

jasonjmcghee commented 10 months ago

Theoretically, what I built should work, we just need to build the swift framework

roblg commented 10 months ago

I guess that's a question I should have asked initially -- is the FFI bridge + rust lib the way you'd prefer to go? Or something more native like CoreML?

jasonjmcghee commented 10 months ago

😅 rust embeddings approach means any safetensors model with config and tokenizers should work, which feels like a very good thing. But if you can get CoreML working- that's awesome. I did noticed they were strangely large - like double the size for gte-small

roblg commented 10 months ago

rust embeddings approach means any safetensors model with config and tokenizers should work

Agreed. The "run anything on the internet" was one of the reasons I felt like my awful embed-Python approach could almost be justifiable. I'm agnostic either way re: rust lib vs coreml, just having fun soaking all this stuff up. For my own entertainment I'll probably throw up a branch on my fork illustrating the coreml approach, but I've got no attachment to it. I've just never played w/ CoreML before.

jasonjmcghee commented 10 months ago

Please! That would be awesome! Thank you- I can't wait.

roblg commented 10 months ago

Not having great luck with prebuilt coreml model. Will post more later on that.

re: rust/candle - I did notice that candle doesn't support metal acceleration yet, only the 'accelerate' framework. I'm not sure if that's a concern with the embedding part, but I could imagine it will be with local LLMs

jasonjmcghee commented 10 months ago

Not having great luck with prebuilt coreml model. Will post more later on that.

You got this!

candle doesn't support metal acceleration yet

Problem for another day. Don't need the best solution, just need one that works for now.

vkehfdl1 commented 10 months ago

Hi, @jasonjmcghee I am making RAGchain, which is specialized framework for RAG. I think you are interested in building RAG in local apple silicon environment. But, I think it will be super cool that get data from rem and ingest it through RAGchain, and talks with LLM about my memories. What do you think about this? Do you prefer "no internet connection" for this project?

jasonjmcghee commented 10 months ago

update (repo here: https://github.com/jasonjmcghee/ragpipe):

This script:


$ ./askRem "Which GitHub issues have I read recently?" <(sqlite3 db 'select text from allText order by frameId desc limit 1000') 
Batches: 100%|███████████████████████████████| 19/19 [00:11<00:00,  1.65it/s]
You have recently read issues: #3 (dark mode icons), #9 (login item - Rem will run on boot), and #11 (icon looks kinda weird when active in dark mode).
total duration:       26.622822625s
load duration:        5.327591125s
prompt eval count:    1933 token(s)
prompt eval duration: 17.73078s
prompt eval rate:     109.02 tokens/s
eval count:           41 token(s)
eval duration:        3.554184s
eval rate:            11.54 tokens/s
jasonjmcghee commented 10 months ago

@vkehfdl1 - definitely want to make it easy to ingest from rem. You can query the sqlite file right now, which will give you the path to the ffmpeg file + frame offset too, so you can get the text and image.

I'd love to simplify this though / make it easy to just ask rem somehow / use it as a datasource

vkehfdl1 commented 10 months ago

@jasonjmcghee Great! I'd love to make data loader from rem for RAGchain. Use rem as a datasource. I'll let you know my progress.

vkehfdl1 commented 10 months ago

@jasonjmcghee I make loader for RAGchain and Langchain. (Compatible with Langchain) It loads texts from sqlite3 file, and make it to Langchain Document schema. You can see PR here.

Now, I'll try to make some kind of demo that using rem and RAGchain together.

seletz commented 10 months ago

@vkehfdl1 that looks very cool! Not knowing too much about RAGChain, how would the data extractor pipeline be run? Would it be beneficial if the extractor is pipeline is triggered by REM at some fixed intervals?

vkehfdl1 commented 10 months ago

@vkehfdl1 that looks very cool! Not knowing too much about RAGChain, how would the data extractor pipeline be run? Would it be beneficial if the extractor is pipeline is triggered by REM at some fixed intervals?

@seletz I just made simple example running RAGchain and rem. (repo here: https://github.com/vkehfdl1/rem-RAGchain) I think it will be super cool that I can trigger ingest pipeline when new rem record is added. From now, you can run ingest.py with crontab. It can run my ingest python script at every x minutes, then new record will automatically ingested, make new embeddings, and use it for talking with LLM!

vkehfdl1 commented 10 months ago

@jasonjmcghee @seletz Plus, here is sample image that I run RAGchain with rem. I saw this issue tab with rem record was turned on 😁

Screenshot 2023-12-31 at 9 49 27 PM

jasonjmcghee commented 10 months ago

Cool!

However, answer quality is not good enough.

Did you try writing a custom prompt for the use-case?

jasonjmcghee commented 10 months ago

Would it be beneficial if the extractor is pipeline is triggered by REM at some fixed intervals?

Could be reading into this the wrong way, but I'd want to make sure it's a client-agnostic approach and ideally, rem isn't facilitating outside applications consuming it's data.

One of my concerns right now though is network access related stuff. Seems like the smart way (from an eng arch perspective) is to have an API for providing access to data and for talking to agents.

but that unlocks "network access" stuff in App Sandbox - which... idk I feel many folks would feel better with a "absolutely no network access" approach.

Maybe there could be 2 builds? One with network access entitlements and one without?

seletz commented 10 months ago

@jasonjmcghee @vkehfdl1 I think a "no network connection" policy is very cool. We could use triggers as mentioned in #14 for this. Maybe it would be OK for now to just call a user-provided script which gets the path to the SQLite DB as argument? The DB tables would be the API, then ...

vkehfdl1 commented 10 months ago

@jasonjmcghee

Did you try writing a custom prompt for the use-case?

I will try your great prompt! Plus, I will try some experiments for improving answer quality. First, it will be good we use hybrid retrieval, which means use vector DB and BM25 together. I think it might be common to search specific word for searching. Like human's name? Second, I want to delete duplicated texts. rem captures screen often, so it has duplicated texts many times. So it needs to compress information somehow. I plan to try various strategies for this. Third, use custom prompt. Fourth, use multi-modal model. Maybe it will take some time to build....

vkehfdl1 commented 10 months ago

@seletz It will be cool! I agree rem will be great to keep "no network connection" as default, and user can always access their data easily with hooks or trigger. It looks fastest way to build RAG with rem from now. However, in the future, it will be cool that rem have their own RAG pipeline, totally local, use local embedding and LLM.

vkehfdl1 commented 10 months ago

@jasonjmcghee I try your custom prompt at here and the result is actually promising. There are some examples I tried. (I record rem issue and repo pages)

Question : Where rem should index all data?
Answer : Rem should index all data in the "allText_content" table in the "main" database.

Question : What is the rem approach for building embedding search and RAG?
Answer : The rem approach for building embedding search and RAG involves indexing all text via an embedding store and using a SQLite extension like sqlite-vss.

But, I tried this like 2 minutes recording only. I'm recording few hours for real use-cases.

vkehfdl1 commented 10 months ago

Update. Now, ingest document without duplicated ones. I used token f1 score to calculate similarity. And, I use hybrid retrieval and WeightedTimeReranker for latest information. This is my PR here. Try it!

However, raw passage (OCR result) is pretty unprocessed, so LLM can't recognize and extract information easily. It can be real challenge for high-quality embedding search and QA with rem. There is no silver bullet from now. Hope OCR quality will increase or use multi-modal models. Some models that truly understand GUI.

jasonjmcghee commented 6 months ago

I think this looks super promising:

https://github.com/ashvardanian/SwiftSemanticSearch