upstash / wikipedia-semantic-search

Semantic Search on Wikipedia with Upstash Vector
https://wikipedia-semantic-search.vercel.app/
MIT License
429 stars 34 forks source link

Can you provide a documentation? How to call this work? How to use this work in your own python program? Thank you! #10

Open willson113 opened 3 months ago

willson113 commented 3 months ago

我有一个数据集含一万余条问答数据集(以维基百科为背景知识库来构建的),想使用您的工作来做做RAG,首先一个问题通过检索维基百科,反馈相关的段落内容,然后提供给不同的大模型(我要做的实验大模型有chatglm llama glm 文心),您的工作可以实现吗?如何来处理呢?期待您的回复,谢谢啦!

willson113 commented 3 months ago

I have a dataset containing more than 10,000 questions and answers (built with Wikipedia as the background knowledge base). I want to use your work to do RAG. First, a question is retrieved from Wikipedia, and the relevant paragraph content is fed back, and then provided to different large models (the experimental large models I want to do are chatglm llama glm Wenxin). Can your work be implemented? How to deal with it? Looking forward to your reply, thank you!

ytkimirti commented 3 months ago

We use our own RAG Chat package in this repo which does most of the heavy-lifting for us when doing RAG. You can look at the examples in the RAG Chat repository and this project as a reference.

From what I understand, you just need to upsert your questions and answers as context to the vector database, that should be enough.

willson113 commented 3 months ago

We use our own RAG Chat package in this repo which does most of the heavy-lifting for us when doing RAG. You can look at the examples in the RAG Chat repository and this project as a reference.

From what I understand, you just need to upsert your questions and answers as context to the vector database, that should be enough.

Chinese Wikipedia is a large-scale data. Can the free vector library provided by Upstash store so much data? Can we access the 1.44 vectors you have vectorized? I am just learning and using it locally. I am not doing any other commercial things.

ytkimirti commented 3 months ago

We use a single Upstash Vector database for this project. For technical details, you can refer to this blog post. You should be able to implement a similar setup; however, given the large dataset, the free tier may not be sufficient. You can review the pricing page for more information on the available plans.

Please note that we are using this index exclusively for this project and currently do not plan to make it publicly accessible.