quran / quran.com-frontend-next

Frontend build on next.js
https://quran.com
1.31k stars 383 forks source link

[feature]: Vector / Similarity Search for English #2057

Open icepaq opened 7 months ago

icepaq commented 7 months ago

Is there an existing issue for this feature?

Summary

The ability to search for Quran verses based on a question, topic, or best recollection of a verse.

A similar feature exists with the implementation of Tarteel.ai. However, to my understanding, you must know the verse (or the majority of it) in Arabic. The proposed feature will help people find verses or answers to questions they have in English.

Your purposed solution for this feature

Salam Alaikum,

How It Works

A few weeks ago, as a side project, I built an OpenAI-powered search tool for the Quran and Hadith. The way it works is that I turned each Quran verse and hadith from Bukhari into an embedding with OpenAI's API. The embeddings are stored in ChromaDB but can be stored in any sort of vector database (possibly including Elasticsearch). Each time a user inputs a phrase, that phrase is converted into an embedding using the same algorithm. Over the last 2-3 weeks, this side project of mine has helped me find verses based on vague questions better than the typical searching algorithm.

Proposed UI / UX

image

When a user searches for a phrase in English, by default they will have a regular search and an option for a vector (labeled AI) search.

Tech and Costs

I have my API running via Flask on a Docker container hosted in Google Cloud Run. Any sort of serverless platform would work given that there is a persistent file system.

OpenAI's embedding API is very cheap. The embedding API costs $0.0001 / 1K tokens. Assuming each search contains 15 tokens we get $0.000067 per search. Which would make each million searches about $70.

Final Notes

I am opening this issue here to facilitate a discussion and am open to sharing/bouncing ideas. If anyone believes that this would cause more overhead than provide benefit, I am also open to hearing that out.

ashaltu commented 4 months ago

Assalamu alaykum warahmatullahi wabarakatu, Bismillah wassalatu wassalam ala Rasulillah,

I recently did a similar project as @icepaq but tried two different models for the embedding, one was an OpenAI embedding model and the other was the Instructor model. In terms of testing results, I'd like to have a more formal comparison with a list of questions and a test group evaluate query responses since personal searches showed mixed results.

I definitely agree with @icepaq in their proposition and assessment of costs, very cheap and cost wise can scale well. Cost could be saved depending on the embeddings model used (e.g. Instructor can be self hosted but requires infra maintenance), Other costs I would consider include building out an API around this(I assume would be free to use) and maintaining it.

I was planning to build this out as a separate website experience, but since I saw this issue raised I figured it'd be worth mentioning it here first. A few extra suggestions where this could be added:

sharabash commented 1 month ago

Can you DM me on Discord so that I can learn more about your implementation? Discord username nsharabash