hwchase17 / notion-qa

MIT License
2.13k stars 376 forks source link

Understanding embeddings #1

Closed aavetis closed 11 months ago

aavetis commented 1 year ago

Can we add general info to the readme about how embeddings work here? Or if you can point me to something I can read up on, I'm happy to update it and PR. Questions coming to mind:

Thanks for any insight.

jasielmacedo commented 1 year ago

Hey, he's using both OpenAI and Feiss to answer to the user.

It essentially takes all of the data and calls the open AI embedding API to get vectors from these texts, then uses the Faiss library to calculate their similarity to these texts.

Faiss is a vector search algorithm by Facebook

So, as soon as you hit ingest, a large amount of data will be injected into the openAI API, charging the key's owner. Usage is paid per input token, at $0.0004 per 1000 tokens, or around 3,000 pages per US dollar (assuming 800 tokens per page). Open AI Guide - Embedding Models

The files that are outputted as a result (docs.index and faiss_store.pkl) - what does each file represent? Do we know if OpenAI keeps a copy of all embeddings made? Or are those effectively "things you own locally"?

The local file is also required to store the received vectors and use to local compare (using Faiss in this case) these vectors and measure their relatedness. You can also reply the result directly to the user or use to compose a prompt when calling the Open AI Completion.

Take a look on the documentation example Open AI Embedding - Use cases

Is there a way to use multiple vector embeddings in a QA prompt? And the system would decide where to look?

And I think this example below answers your last question Example from Open AI of a question answering using embeddings

Guaderxx commented 11 months ago

Hey, he's using both OpenAI and Feiss to answer to the user.

It essentially takes all of the data and calls the open AI embedding API to get vectors from these texts, then uses the Faiss library to calculate their similarity to these texts.

Faiss is a vector search algorithm by Facebook

So, as soon as you hit ingest, a large amount of data will be injected into the openAI API, charging the key's owner. Usage is paid per input token, at $0.0004 per 1000 tokens, or around 3,000 pages per US dollar (assuming 800 tokens per page). Open AI Guide - Embedding Models

The files that are outputted as a result (docs.index and faiss_store.pkl) - what does each file represent? Do we know if OpenAI keeps a copy of all embeddings made? Or are those effectively "things you own locally"?

The local file is also required to store the received vectors and use to local compare (using Faiss in this case) these vectors and measure their relatedness. You can also reply the result directly to the user or use to compose a prompt when calling the Open AI Completion.

Take a look on the documentation example Open AI Embedding - Use cases

Is there a way to use multiple vector embeddings in a QA prompt? And the system would decide where to look?

And I think this example below answers your last question Example from Open AI of a question answering using embeddings

Hello, bro, I found this after running the Open AI tutorial. I kind of understand what a QA system and Embeddings.

But I don't quite understand what job langchain does.

Thanks for any insight