opea-project / GenAIExamples

Generative AI Examples is a collection of GenAI examples such as ChatQnA, Copilot, which illustrate the pipeline capabilities of the Open Platform for Enterprise AI (OPEA) project.
https://opea.dev
Apache License 2.0
217 stars 134 forks source link

Proposal and feedback requested: Wikipedia RAG GenAIExamples #603

Open endomorphosis opened 4 weeks ago

endomorphosis commented 4 weeks ago

I am mentoring some college students with LAION, one of the students is working on embeddings for wikipedia, and its not yet ready to be pushed to OPEA yet, but I want to collect feedback about an issue we discussed.

Do you guys prefer to have the entire article text be in the vectordb, or would you prefer to have only the article abstract be in the vector db. Also I had asked him to follow the example on the huggingface datasets, with regards to using the hugginface FaissIndex and elasticsearch index, but I want to confirm that this is the method that works best for you guys.

@sleepingcat4 is the college student. and his WIP repository is located here https://github.com/sleepingcat4/wikidataset and his WIP dataset is here https://huggingface.co/datasets/laion/Wikipedia_11_23_BGE-M3_Embeddings (but I have told him he needs to rework both of these, so be aware that this information is going to change.

kevinintel commented 2 weeks ago

I prefer entire text. Btw, OPEA is microservice-based, please think how to contribute to OPEA.

christophschuhmann commented 2 weeks ago

Okay, then the entire text :)

sleepingcat4 commented 2 weeks ago

@kevinintel what do you think of using Late chunk for generating our full-text wiki's embeddings

https://colab.research.google.com/drive/1IIAHEomlhUAisIz1NJTdVdtq-2A0SSCS?usp=sharing\

It's specifically designed to leverage higher context window made available by recent mother Embedding models. And allows easily to capture the semantic relationships between sentences in different chunks easily.

I was thinking we could use this method to generate our full-text embeddings (it's not done for making big datasets yet)

endomorphosis commented 2 weeks ago

https://github.com/HabanaAI/vllm-fork/pull/144 This week I am going to continue with attempting to get llama 405b with speculative decoding with llama 8b working, and process some Wikipedia datasets and embeddings, I am going to use recursive summarization and sliding window embeddings on the article text where the article text exceeds the embedding window. Ammar is producing the abstract embeddings right now image

sleepingcat4 commented 1 week ago

https://huggingface.co/datasets/laion/Wikipedia-X https://huggingface.co/datasets/laion/Wikipedia-X-Full

the datasets for both Abstract and full of wikipedia in 17 different languages are created. Embeddings are being run on a 3090 server. (my repo has the updated code to compile the dataset.

kevinintel commented 1 day ago

Please try to create PR first, Late Chunk may not better than current embedding, but you are welcome to expand the functionalities

endomorphosis commented 1 day ago

The abstracts embeddings are still running. https://huggingface.co/datasets/laion/Wikipedia-M3

I made a repository for searching through the embeddings, but I am working on the embedding generation scripts, to recursively summarize and chunk the embeddings.

https://github.com/endomorphosis/laion-embeddings/tree/master

sleepingcat4 commented 1 day ago

@endomorphosis @kevinintel

https://huggingface.co/datasets/laion/Wikipedia-M3

Wikipedia M3 is done. In this dataset, I made Abstracts' embeddings for 10 most widely spoken and active research group languages.

These languages are:

  1. English
  2. German
  3. Polish
  4. French
  5. Spanish
  6. Portuguese
  7. Italian
  8. Russian
  9. Hebrew
  10. Chinese

Initial focus for embeddings was North American, South American and European languages exception being Chinese. We plan to expand into Japanese and Korean in our next iteration with a different and more advanced Embedding model such as JINA AI (8K) model from Germany and JINA AI, COLBERT embedding models.